Hi,
I need an opinion about either using normalized counts from EDASeq or raw counts with offset values of EDASeq in edgeR GLM for differential expression. I read edgeR manual where it says:
"The correction factors may take the form of scaling factors for the library sizes, such as computed by calcNormFactors, which are then used to compute the effective library sizes. Alternatively, gene-specific correction factors can be entered into the glm functions of edgeR as offsets."
Also it is mentioned that estimateCommonDisp and estimateTagwiseDisp require the library sizes to be equal for all samples for exactTest. Is this applicable to GLM dispersion estimation as well? If so then it seems I would have to use calcNormFactors() if the library sizes are not equal whether normalized offset values are provided or not. But then it is also mentioned that alternative to calcNormFactors are offset values other software such as cqn or EDASeq. How should I run differential expression on values normalized by EDASeq? Should I give raw values with offset values or normalized counts from EDASeq in edgeR?
Thanks,
Rahil
Thanks Gordon for clarifying. Maybe I misunderstood mentioned in the User Guide. This is what it says in page 13 (Section 2.6.7 Pseudo-Counts)
"In general, edgeR functions work directly on the raw counts. For the most part, edgeR does not produce any quantity that could be called a “normalized count”. An exception is the internal use of pseudo-counts by the classic edgeR functions estimateCommonDisp and exactTest. The exact negative binomial test [20] computed by exactTest and the conditional likelihood [20] used by estimateCommonDisp and estimateTagwiseDisp require the library sizes to be equal for all samples."
Regarding your 4rth point I got the same impression after reading the user guide, but then I saw your reply to one person who was trying to use normalized counts from EDASeq as input to edgeR and you did not give that person a warning:
problem with aveLogCPM.default in edgeR
In EDASeq when the dataframe contain normalized and raw counts with the offset values then exprs() function gives normalized counts instead of the raw counts as far as I checked on my data. So that is why I wanted to confirm it finally.
Yes, you have jumped to some incorrect conclusions from what is said in the edgeR documentation. The internal computation of conditional likelihood does use equal library sizes, but this is just for mathematical convenience and computational speed. The user-level edgeR functions themselves do not make this assumption (because they do the necessary equalization internally). I am the senior author of the estimateCommonDisp and estimateTagwiseDisp functions and I wrote the section of the User's Guide that you are quoting, so it might be constructive to assume that I am telling you the truth.
Regarding the earlier post that you give a link to, I advised the questioner to pass an offset matrix from EDASeq to edgeR, which is the same advice I am giving to you. I did not give any warning about "normalized counts" because I assumed that EDAseq was normalizing by offset and not changing the counts themselves.
Perhaps I should clarify that EDASeq returns *both* an offset, to be used in supervised problems, i.e., differential expression, by for instance passing it on to edgeR, *and* the normalized counts to be used for visualization / unsupervised problems. The functions counts() and normCounts() can be used to access the original and normalized counts, respectively (see help of SeqExpressionSet-class). The use of "exps()" is deprecated in EDASeq.