Correct formulation of contrasts of interest regarding an RNA-Seq gene expression analysis with edgeR
1
0
Entering edit mode
fotakisg • 0
@fotakisg-17384
Last seen 2.1 years ago
Austria

Dear Bioconductor Support Site,

I'm currently working with an RNA-Seq dataset, which consists of 84 samples(Collaborative Cross mice genetic panel of normal, diabetic and obese phenotypes).  My code for the analysis used so far is the following:

head(sampleInfo) # the phenotype data information 

SampleName Sex Obesity Diabetes
1      DT_77   M   Obese Diabetic
2      DT_39   F   Obese   Normal
3     DT_145   M  Normal   Normal
4     DT_130   F   Obese   Normal
5     DT_103   M  Normal   Normal
6     DT_124   M   Obese Diabetic

Next, after the creation of DGEList object/filtering/normalization, for the construction of the design matrix, I considered two separate approaches,to find DE genes concerning the obesity phenotype, but taking into account diabetes, while blocking on the sex confounder:

1) First Approach

# Create the design matrix
condition1.group <- as.factor(dge$samples$Obesity)
condition2.group <- as.factor(dge$samples$Diabetes)
sex.group <- as.factor(dge$samples$Sex)
design <- model.matrix(~condition1.group : condition2.group + sex.group)

y3 <- estimateDisp(dge, design, robust=TRUE)

Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05,  :
  Design matrix not of full rank.  The following coefficients not estimable:
 condition1.groupObese:condition2.groupNormal
 head(design)
  (Intercept) sex.groupM condition1.groupNormal:condition2.groupDiabetic condition1.groupObese:condition2.groupDiabetic
1           1          1                                               0                                              1
2           1          0                                               0                                              0
3           1          1                                               0                                              0
4           1          0                                               0                                              0
5           1          1                                               0                                              0
6           1          1                                               0                                              1
  condition1.groupNormal:condition2.groupNormal condition1.groupObese:condition2.groupNormal
1                                             0                                            0
2                                             0                                            1
3                                             1                                            0
4                                             0                                            1
5                                             1                                            0
6                                             0                                            0

fit <- glmQLFit(y3, design, robust=TRUE)

fit2 <- glmQLFTest(fit, coef=2)

2) Second Approach

mult_con <- factor(paste(condition1.group, condition2.group, sep = "."))
sample2 <- cbind(sampleinfo,Group=mult_con)
design <- model.matrix(~0+sample2$Group + sex.group)

head(design)
  sample2$GroupNormal.Diabetic sample2$GroupNormal.Normal sample2$GroupObese.Diabetic sample2$GroupObese.Normal sex.groupM
1                            0                          0                           1                         0          1
2                            0                          0                           0                         1          0
3                            0                          1                           0                         0          1
4                            0                          0                           0                         1          0
5                            0                          1                           0                         0          1
6                            0                          0                           1                         0          1

my.contrasts <- makeContrasts(OvsN = sample2$GroupObese.Normal- sample2$GroupNormal.Normal, ONvsOD = sample2$GroupObese.Normal-sample2$GroupObese.Diabetic, levels=design)

# Obese_Normal vs Obese_Diabetic example

qlf.ON.OD <- glmQLFTest(fit, contrast=my.contrasts[,"ONvsOD"])
de.ON.OD <- topTags(qlf.ON.OD,n=nrow(qlf.ON.OD),adjust.method="BH", p.value=0.05)
dat.ON.OD <- de.ON.OD$table

Thus, my questions are the following;

A) 1) For the 1rst approach: regarding the initial error of full Rank, should I remove manually some coefficients that are not of "interest " ? For example, the aforementioned "condition1.groupObese:condition2.groupNormal" ?

2) Moreover, is the formulation and interpretation of the interaction approach correct ? The coefficient condition1.groupNormal:condition2.groupDiabetic , essentially compares the Normal samples from the Obesity group, with the Diabetic samples from the Diabetes group, correct ?

B) Overall, which of the two approaches should I consider ? Should I expect significant differences between the two approaches for the same comparisons ?

Thank you in advance and excuse me for my long post !!

 

rnaseq edger design and contrast matrix multifactorial design • 1.0k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 7 hours ago
The city by the bay

Your second model parametrization is much easier to use and interpret. The first four coefficients represent the average log-expression within each obesity-disease combination, while the last coefficient represents the log-fold change of males over females. You can then simply test for pairwise differences between combinations:

# Paraphrasing the column names to keep things simple:
con <- makeContrasts(GroupNormal.Diabetic - GroupNormal.Normal, levels=design)

... or if you want to see the average effect of obesity:

con <- makeContrasts((GroupNormal.Diabetic + GroupNormal.Normal)/2
    - (GroupObese.Diabetic + GroupObese.Normal)/2, levels=design)

... or you can test the obesity effect separately between disease and normal patients, and intersect the DE lists.

I won't bother trying to figure out what the coefficients mean in your first approach, as your second approach is much more usable. I will just mention that, if you were to remove the column corresponding to the unestimable coefficient, the first design matrix would be mathematically equivalent to that in your second approach. That is, I could get the second design matrix from linear combinations of the columns in the first design matrix. However, the second design matrix is much easier to interpret, so you might as well start with that.

ADD COMMENT
0
Entering edit mode

Dear Aaron,

Thank you for your reply and your input, especially for the average effect of obesity between patient and normal samples.

Best regards,
Fotakis Georgios

ADD REPLY

Login before adding your answer.

Traffic: 701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6