I am bit in doubts if I should use estimateGLMCommonDisp(y,design) or estimateGLMTrendedDisp(y,design). There is however a difference in DE genes, not a big one but still. How should the ideal plots look like?
My BCV plots are following:
Trended dispersion estimated:
and only common dispersion:
Why should I use estimateDisp? In manual it is stated that I can alternatively use another functions:
Alternatively, one can use the following calling sequence to estimate them one by one. To estimate common dispersion:
To estimate trended dispersions:
To estimate tagwise dispersions:
Because
estimateDisp
provides protection against zero fitted values, gives a more stable likelihood landscape (due to hot-starting at every step of the grid search), gives a more graduated trend from local fitting compared to binning, and can estimate the necessary prior degrees of freedom for EB shrinkage from the data. Oh, and it's under active development. I think that we're intending forestimateDisp
to subsume the functionality of the other estimation functions... eventually.If I use estimateDisp on another dataset instead of those three commands, I get following BCV plot:
and if I use those three commands instead, I get:
So, the first pic produced by estimateDisp seemed a bit strange to me, so I decided to avoid this function and use the three command instead. What do you think?
There's no problem. As Aaron explained, the newer pipeline uses the data to estimate the prior degrees of freedom for the tagwise dispersions. The older pipeline simply uses 10 prior df unconditionally unless you tell it to use some other value. If you look at your
DGEList
object after runningestimateDisp
, it probably has a very large or even infinite prior df, indicating that there is insufficient evidence of gene-specific dispersions in your dataset, so that each gene simply uses the trend instead of getting its own dispersion. This is likely to happen if you have very few samples and/or low sequencing depth. In either case, theestimateDisp
result is more honest about what information you do and do not have about each gene's dispersion.But what does it mean regarding the DE genes then if my BCV looks like the first picture?
I'm not sure what you're asking. In general, higher prior d.f. means that you are able to share more information across genes. This improves the precision of the dispersion estimates and increases power to detect DE genes. There's no cause for alarm based on this plot; the mean-dispersion trend has a nice decreasing shape and the values on the y-axis are fairly low. The fact that every dispersion is shrunk to the trend just means that the genes in your data have dispersions that don't vary much from the trend. The only extra thing I can think of is to set
robust=TRUE
, to protect against overzealous shrinkage of outlier dispersions.Just a small question regarding those functions. The RUVseq manual uses following:
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
Also, DiffBind does not use estimateDisp function in the edgeR analysis.
What is the motivation that they use the functions separately then?
The code for those packages was likely written before the
estimateDisp
function was introduced into edgeR.Would it be advisable to use glmQLFit over glmFit as I have only two replicates per condition?
That's fine. The whole point of information sharing via EB shrinkage (of the QL or NB dispersions) is to overcome problems due to low numbers of replicates. Obviously, though, the more replicates the better.
As far as I understood the main difference between QL and NB dispersions is that QL accounts for gene-specific variability, meaning the more replicates the better. If I have only two replicates, there is no point to use QL, right? Or when one should use QL and when NB?
EB shrinkage of both QL and NB dispersions can account for gene-specific variability. However, the QL approach accounts for the uncertainty of the estimates in a more statistically rigorous manner. This ensures that type I error control is maintained during testing. In contrast, shrinkage of the NB dispersions is a bit less rigorous; the NB distribution doesn't have a conjugate prior for the dispersion parameter, and the uncertainty of dispersion estimation is difficult to model.
Besides, you can still have gene-specific variability with two replicates. If one gene is variable between the two replicates, and another is not variable, then - behold - you have gene-specific variability.