design for glm for edgeR
1
0
Entering edit mode
gthm ▴ 30
@gthm-8377
Last seen 5.6 years ago
spain

I would like to compare the DE genes between Treatment (T) vs Control (C). The Treatment and controls have subgroups ( which are minor variants of treatments/controls ), so I would expect some genes to show differential expression between different subgroups of treatment/control, but I am only interested in genes that show DE between T and C. I want to eliminate any genes that show DE only between subgroups. i.e If a gene is expressed in T1 and T2 but not in T3, I would like to skip those genes from DE analysis. 

 

Treatment

 

sub_group

T T1
T T1
T T1
T T2
T T2
T T2
T T3
T T3
T T3
C C1
C C1
C C1
C C2
C C2
C C2

What I am planning to do is the following and I would like to know if its the correct approach.

treat <- c(rep(c("T"),each=9), rep(c("C"),each=6))
sub <- c(rep(c("T1"),each=3),rep(c("T2"),each=3),rep(c("T3"),each=3),rep(c("C1"),each=3),rep(c("C2"),each=3))
design <- model.matrix(~sub+treat)
y <- DGEList(counts=counts_data, group=treat)
y <- calcNormFactors(y)
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit)

edgeR • 1.4k views
ADD COMMENT
0
Entering edit mode

note: I edited away the DESeq tags because it looks like this is a question about how to do an edgeR analysis.

ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 6 hours ago
The city by the bay

Your design matrix will not be of full column rank, as sub is nested within treat. It's impossible to estimate the last coefficient - for example, a treatT of 1 and a subT1, subT2 and subT3 of 0 would give the same fitted values as a treatT of 0 and the others equal to 1. You need to construct the design matrix using only one factor or the other. I'd probably construct it using sub if you're expecting some DE between subpopulations. Even if you're not interested in the DE between subpopulations, you should still account for it in your model, otherwise your dispersion estimates will be inflated.

Now, for the construction of the contrasts, you can do something like:

con <- makeContrasts((subT1 + subT2 + subT3)/3 - subC2/2, levels=design)

... and feed this as the contrast= argument in glmLRT, to test for differences between the average of the treated to the control subpopulations. The division by two for the control is intended, even though only one coefficient is present; this is because the C1 expression is implicitly present as the intercept, but cancels out so it doesn't show up in the above expression. Anyway, genes that are significant in the above contrast will be those that are DE between treatment and control. Note that this does not exclude genes that are DE between subpopulations, which is fine - for example, if you have a gene that was DE between T1, T2 and T3, but the expression in all treated subpopulations was upregulated relative to C1/2, then you'd want to call it as being DE between treatment and control.

That said, be careful in your interpretation, as you might get some genes where expression is being driven by a strong log-fold change in only one subpopulation. At worst, you could end up with a scenario where, e.g., T1 and T2 are downregulated relative to C1/2 but T3 is very strongly upregulated, such that the treated average is increased over the control. This would give a misleading impression of upregulation upon treatment. To protect against this, I would also do an ANOVA-like contrast by setting coef=2:5, and only look at the genes where the subT* coefficients were all positive and greater than subC2, or all negative and less than subC2. This, at least, ensures that all treated subpopulations are changing in the same direction relative to the control subpopulations.

ADD COMMENT
0
Entering edit mode

what would happen if I just use  treat <- c(rep(c("T"),each=9), rep(c("C"),each=6))

and leave out the subgrouping in the analysis ? Just ignoring the subgroups ?

ADD REPLY
1
Entering edit mode

Your dispersion estimates would be increased, because systematic differences between subgroups would be absorbed into the dispersion when subgroups are considered to be direct replicates of one another. You'll probably end up with a more conservative result (i.e., fewer DE genes) because of this effect. Also, edgeR assumes that replicates samples are independent; that won't be the case if you ignore the subgroups, because there will be dependencies between samples of the same subgroup (i.e., they will have similar counts due to a subgroup-specific mean). This can pose problems if the differences between subgroups are large enough. Your safest bet is to model the subgroups - this will complicate your DE comparisons, but it's better to have complications that you know about than complications that you don't know about.

ADD REPLY

Login before adding your answer.

Traffic: 771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6