hi,
the documentation of the aveLogCPM() function from edgeR says that this function is similar to:
log2(rowMeans(cpm(y, ...)))
because CPM values may be more disperse than logCPM values and the arithmetic mean is sensitive to outliers, would not make more sense that aveLogCPM() resembles rowMeans(cpm(y, log=TRUE, ...))) instead?
In the old microarray days one would take the average of the log-fluorescent units to look at average expression, so I had expected that aveLogCPM() would also take the mean value over the logarithmic scale. I guess there's a good reason for this but could not figure it out from the documentation.
thanks!!
robert.
thanks! does the use of the NB mean for filtering lowly-expressed genes still applies if, instead of edgeR, i use the limma-voom pipeline for differential expression?
If you're fitting a linear model to log-expression values, the independent filter statistic would technically be the mean log-expression across samples. However, I wouldn't worry about it; indeed, the edgeR user's guide describes different filtering strategies altogether. This is because the choice of filter statistic doesn't matter much for RNA-seq, as the density of genes at the filter boundary is low (i.e., different filtering strategies yield similar sets of retained genes). By comparison, filtering requires more care in applications such as ChIP-seq, where high-abundance regions are less frequent. The choice of filter will have a much greater effect on the analysis, simply because there's more low-abundance regions that are affected by the behaviour of the filter statistic.