Question

Differential gene expression for multi-factorial experiment using edgeR

0

Entering edit mode

akulkarn • 0

@akulkarn-11834

Last seen 6.0 years ago

New York, United States

Hello Bioconductor community,

I am trying to find differentially expressed genes, between placebo P and drug M, as well as within placebo and drug groups at different time points. My experimental design is multi-factorial where some subjects receive placebo P and others receive drug M. I have gene expression data from samples collected for both drug and placebo, at three time points (1,2,3). Following edgeR manual section 3.3.1, I have generated the following targets data frame.

> targets
   subjects treatment time
1         1         P    1
2         1         P    2
3         1         P    3
4         2         M    1
5         2         M    2
6         2         M    3
7         3         M    1
8         3         M    2
9         7         M    1
10        7         M    2
11        7         M    3
12       10         M    1
13       10         M    2
14       10         M    3
15       11         P    1
16       11         P    2
17       11         P    3
18       12         P    1
19       12         P    2
20       12         P    3

In order to identify DEGs between groups of my interest, I have made contrasts that I want to incorporate in the glmQLF line of code. However, I also want to carry out a paired-analysis mentioned in edgeR manual section 3.4.2 in order to adjust for baseline differences between the subjects. Hence, using a suggestion from this post (differential expressed genes, paired samples and multiple factors) I have built a design matrix using

design <- model.matrix(~0+ treatment:time + subjects)

The columns of this design matrix contain columns for subjects and treatmentM:time1, treatmentP:time1, treatmentM:time2, treatmentP:time2, treatmentM:time3 and treatmentP:time3. However, when I get the following error when I estimate gene dispersion using

y <- estimateDisp(gene_filt, design)
Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05,  : 
  Design matrix not of full rank.  The following coefficients not estimable:
 treatmentM:time3 treatmentP:time3

I would really appreciate suggestions whereby I can compare between my conditions of treatment:time, while adjusting for differences between the subjects.

Thank you

edger differential gene expression multiple factor design • 1.5k views

ADD COMMENT • link updated 7.6 years ago by Aaron Lun ★ 28k • written 7.6 years ago by akulkarn • 0

1

Entering edit mode

See Section 3.5 of the edgeR User's Guide. In your experiment, treatment is a "between subject" factor while time is a "within subject" factor. Paired analyses are only for within subject factors.

ADD REPLY • link 7.6 years ago Gordon Smyth 52k

score 3 · Answer 1 · 2017-09-28

It's pretty much as it says, your design matrix is not of full rank. You can't estimate the treatment:time effects and account for the subject-level effects at the same time. Consider, for example, a gene that increases in expression in all P.1 samples compared to the M.1 samples. You could attribute this to a P vs M effect, or to a subject-specific effect that only affects samples 1, 11 and 12; there's no way to mathematically distinguish these two possibilities if subject is included as a factor in your design matrix.

There are multiple solutions to this problem. The most appropriate is to switch to voom with limma, and use duplicateCorrelation with a one-way layout of treatment:time factors and subjects as the batch. This will effectively treat subjects as a random effect, avoiding dependencies with the treatment:time factor while still making use of all samples in the data set:

combo <- paste0(treatment, time, ".")
design <- model.matrix(~0 + combo) # no intercept, for simplicity
# Some time later, after running voom to get 'v':
dc <- duplicateCorrelation(v, design, batch=subjects)
fit <- lmFit(v, design, batch=subjects, correlation=dc$consensus)

If you want to continue using edgeR, you will need to drop some coefficients to eliminate the linear dependencies. I would suggest dropping all coefficients with time1 in the name. This means that the remaining treatmentX:timeY coefficients will represent the log-fold change of time Y over time 1 in treatment X. Each of these can be dropped to detect DE between time points in the same treatment group. However, if you want to compare the expression of samples between treatment groups, you will need to subset the data set to only contain the relevant samples, i.e., only one sample from each level of subjects. This ensures that all samples are independent, as there will no longer be dependencies between samples from the same subject.