edger glm, classic, dispersions
1
0
Entering edit mode
tonja.r ▴ 80
@tonjar-7565
Last seen 8.2 years ago
United Kingdom

I got totally confused when looking at edgeR manual somehow:
The are different ways of estimating the dispersions (GLM and a classic approach). Also, one might estimate either CommonDisp or TrendedDisp.
I get totally different results when applying GLM and classic and also when excluding TrendedDisp.
My experiment is quite simple: I have 2 treated and 2 untreated samples and want to find differential expressed genes. 
I also found that I have a batch effect, so that I got batch-values from RUVseq. What would be an applicable method (GLM or classic) to use if I want to intergrate batch effect into the design matrix. Also, when should I use estimateGLMTrendedDisp over estimateGLMCommonDisp?


GLM:

y <- estimateGLMCommonDisp(y, design)

y <- estimateGLMTrendedDisp(y, design)

y <- estimateGLMTagwiseDisp(y, design)
then apply glmFit and glmLRT

 

classic:

y <- estimateCommonDisp(y)

y <- estimateTagwiseDisp(y)
then apply exact test



EDIT:
also I get different BCV plots when I run either y <- estimateDisp(y, design) or those three commands together

y <- estimateGLMCommonDisp(y, design)

y <- estimateGLMTrendedDisp(y, design)

y <- estimateGLMTagwiseDisp(y, design)
​

​what is quite weird because according to the edger manual they should give the same results.

edger • 2.1k views
ADD COMMENT
0
Entering edit mode

Define "totally different". Totally different dispersions? Totally different DE genes?

ADD REPLY
0
Entering edit mode

I get different DE genes and different BCV plots and also strange p.values. With the "trended dispersion" most of the p.values are 1.

ADD REPLY
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 10 hours ago
The city by the bay

It's hard to know what's going on without having some diagnostics, but here's some general points. Firstly, you haven't performed trended dispersion estimation in your code with classic edgeR. This will obviously yield different results to the GLM method where you get trended estimates, because the mean-dispersion trend provides a more accurate model of the empirical mean-variance relationship. Secondly, the GLM functions are implemented differently from estimateDisp, so the dispersions that you get out of them will not be exactly equal, even at the best of times. Finally, it's not surprising that exactTest and glmLRT give different results, as they operate off different statistical frameworks. This shouldn't matter for strongly DE genes, but I suspect that it will affect the genes that are borderline significant.

More educated advice would require examination of the BCV plots - if these are substantially different between methods, then the p-values will inevitably be different. The various dispersion estimation functions should (and do) give similar results for well-behaved data. Large differences should only manifest if there's something weird going on. From the information you've supplied, this is not out of the question for your data set, especially if there's a big batch effect in there somewhere.

As for the batch blocking factor, you'll need to add it as an additive term to your design matrix. You've only got 2 residual d.f. in your design, so you can only afford a single blocking term. You can probably add it in by doing something like this:

design <- model.matrix(~batch + grouping)

... and then make sure you drop the last coefficient to test for DE between groups. Obviously, this only works if your batch effect doesn't confound your groupings of interest. If both samples of one group are in one batch, and both samples of the other group are in another batch, then you won't be able to separate your batch effect from genuine DE.

Edit: Actually, Gordon reminded me that estimateTagwiseDisp fits a trend by default, so my first point shouldn't be a problem. The fitting algorithm is different from the GLM methods, though - the former uses a moving average compared to a bin spline for the latter. This will still cause some differences, though I wouldn't have expected anything too dramatic.

ADD COMMENT
0
Entering edit mode

The funny thing is that my treated.1 and control.1 were done together and then treated.2. and control.2. I run RUVseq to find the coefficients for the batch effect.

... and then make sure you drop the last coefficient to test for DE between groups.

So, to test for DE I need to modify my design matrix that will count only for the batch effect, right? And before 
design <- model.matrix(~batch)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2) 


 

ADD REPLY
1
Entering edit mode

No. In a hypothetical case where your batch effect confounds your treatment conditions, there's little that can be done. Regardless of whether you only put batch in your design, or batch with grouping, when you do your DE testing you won't know whether your differences are due to genuine DE between groups or spurious differences caused by the batch.

Now, in your case, if any batch effect were to exist, then I would expect it would be between the first pair of treated/control and the second pair. In this case, DE within each pair can still be estimated. You can just specify the batch effect manually:

batch <- factor(c(1,1,2,2))
grouping <- factor(c("c", "t", "c", "t"))
design <- model.matrix(~0 + batch + grouping)

The last coefficient will represent the treatment/control log-fold change between groups, while the first two coefficients are blocking factors for the batch (these represent the log-expression of the control sample in each batch). Dropping the last coefficient will subsequently test for DE between treatment and control, while avoiding inflated dispersions due to the batch.

ADD REPLY
0
Entering edit mode

Just a small question, 

if I use design <- model.matrix(~batch + grouping), how would it change the interpretation?
 

 

 

ADD REPLY
0
Entering edit mode

You would still drop the last coefficient to test for DE between groupings; that won't change. The first coefficient (i.e., intercept) will still represent the log-expression of the control sample in the first batch, but now the second coefficient will represent the log-fold change of the second control sample over the first. I prefer the intercept-free model where the second coefficient represents the log-expression of the second control sample, but that's largely a matter of taste. It won't make any difference in your case because you're not interested in testing for DE between the batch effects; the batch is just a nuisance variable to you, so it doesn't really matter how it's parametrised.

ADD REPLY

Login before adding your answer.

Traffic: 739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6