I understand that, under edgeR, the dge$AveLogCPM after estimateGLMCommonDisp gives the same result as doing aveLogCPM(data, lib.size = dge$samples$lib.size*dge$samples$norm.factors). Hence my question would be: if the aveLogCPM function implements in this case the effective library size, as opposed to the default colSums(data), why then it is not implemented the dge$common.dispersion as well, instead of the 0.05 default value?
Thanks for your reply, but I am not sure thiat is the case cause if I run dge$AveLogCPM, it gives me exactly the same result as aveLogCPM(data, lib.size = dge$samples$lib.size*dge$samples$norm.factors), but different from AveLogCPM(data, lib.size = dge$samples$lib.size*dge$samples$norm.factors, dispersion = dge$common.dispersion).
Technicalities aside (you can't rundge$AveLogCPM, it's not a function), the DGEList method for estimateGLMCommonDisp first calculates the average abundances with the default dispersion of 0.05 because it doesn't yet have the common dispersion to use in the calculations. These average abundances are only used for subsetting to improve efficiency, and then they get re-calculated with the common dispersion for output. I'm not sure how you managed to get average abundances from estimateGLMCommonDisp that don't use the common dispersion estimate.
Edit: After some thought, it seems that you're probably using an old version of edgeR, in which the S3 method for estimateGLMCommonDisp hadn't been updated to re-calculate the average log-CPM with the common dispersion. So if you update to the latest version of edgeR, all these discrepancies should be resolved.
What you're missing with aveLogCPM() is that you're not passing the correct data object to it.
If you compute
A <- aveLogCPM(dge)
and 'dge' is a DGEList object, then the computation will use both the effective library sizes and the common dispersion. In other words, the function does exactly what you expect.
If however you compute
A <- aveLogCPM(data)
and 'data' is just a matrix of counts, how would you expect the function to know what the common dispersion is?
Your original question though relates to the behaviour of the edgeR dispersion estimation functions such as estimateGLMCommonDisp(). None of the edgeR dispersion estimation functions update the dge$aveLogCPM vector as the dispersions are updated. This is a deliberate decision. Instead we compute the dge$aveLogCPM vector right at the beginning of the pipelines, before the common dispersion has been estimated, and then keep it fixed. It would be just too confusing for this vector to keep changing.
The dge$aveLogCPM vector is used for computing trends, for MD plots and so on. By using the default dispersion=0.05, edgeR is already finessing this measure more than other packages do. In principle we could recompute the aveLogCPM component of the DGEList object every time new dispersions are estimated. We could do that for trended and tagwise dispersions as well as when a new common dispersion is estimated. But the potential gain that might be had from updating the aveLogCPM component to match each new estimated dispersion is too small to justify the extra complication and possible confusion that would arise from continually changing it.
Thanks for your reply. I understand what you say about not recalculating AveLogcpm with every subsequent dispersion, and I am happy with it.
However, as for my original question, I still do have my doubts. I agree that running AveLogcpm on the original data cannot use the common dispersion cause it has obviously not been calculated yet, and that is my whole point:
A <- dge.$AveLogCPM)
We agree that in order to have A, I should expect that effective library sizes and common dispersions values would be used.
However, it is seemingly not the case since the very same value A is obtained if I run the line below, in which no common dispersion value is given.
Moreover, and I did this just in order to verify that this was the case, the same value B is obtained if I use dge instead of data on the line above.
I thank you in advance for your comments on this. Apart form my daily work, I would like to provide some training on RNASeq differential analysis using edgeR and I would like to understand as much as possible about all possible details of it.
Actually, if a common dispersion is available in the DGEList object, then that value will be used by aveLogCPM.DGEList - check out the source code. Admittedly, the documentation could have described this better, but it seemed to be a level of technical detail that most users wouldn't care about.
I am sorry to come back to this. I have had a look to the source codes and I see what you meant, when I go through the .DEGlist source codes, it looks OK to me. Nevertheless, I still see an issue in how the $AveLogCPM values are implemented into the dge object. I've tried to be more specific with the lines here below. I'd be very grateful if you could provide me with any help.
Thanks for your reply. I run update.packages last Thursday, hence it should be fine.. I'll have a look when I get back to my desk and my R session and I'll let you know. Thanks again.
You were so right. I had a configuration problem with my new computer. Inherited packages from my old laptop, including edgeR, were in one place but updates were being installed somewhere else. Hence, even if I was updating packages, RStudio kept loading the old versions. I've sorted it out and now everything looks fine. I thank you and I apologize of rthe unnecessary hassle.
Hi Aaron,
Thanks for your reply, but I am not sure thiat is the case cause if I run dge$AveLogCPM, it gives me exactly the same result as aveLogCPM(data, lib.size = dge$samples$lib.size*dge$samples$norm.factors), but different from AveLogCPM(data, lib.size = dge$samples$lib.size*dge$samples$norm.factors, dispersion = dge$common.dispersion).
Am I missing something here?
Thanks again.
Technicalities aside (you can't run
dge$AveLogCPM
, it's not a function), theDGEList
method forestimateGLMCommonDisp
first calculates the average abundances with the default dispersion of 0.05 because it doesn't yet have the common dispersion to use in the calculations. These average abundances are only used for subsetting to improve efficiency, and then they get re-calculated with the common dispersion for output. I'm not sure how you managed to get average abundances fromestimateGLMCommonDisp
that don't use the common dispersion estimate.Edit: After some thought, it seems that you're probably using an old version of edgeR, in which the S3 method for
estimateGLMCommonDisp
hadn't been updated to re-calculate the average log-CPM with the common dispersion. So if you update to the latest version of edgeR, all these discrepancies should be resolved.