Hi,
This is a pretty well discussed subject, nevertheless every time I analyze some new form of RNA-seq data I hit the issue of how to filter lowly expressed genes in a differential expression analysis.
My data are read counts of micro-RNAs, which have somewhat of a lower expression range than mRNA.
I have 4 experimental conditions (4 genotypes), with 3 sample for each one, which I'm using limma for the differential expression analysis.
If I follow the limma guide and keep exons that have more than 1 cpm in at least 3 samples I loose quite a lot of microRNAs, some of them are real signal, since these 3 samples may all be from the same genotype that is down regulated.
Perhaps a more sensible filtering approach is to set to zero all samples of a certain experimental condition for which 3 or more samples have cpm <= 1. The problem here is that the cutoff is arbitrary and therefore genes which in one condition were a bit below the cutoff and hence set to 0, but in another condition were a bit above it and hence left as they are, will be false positives.
So my question is if there is a happy medium?
Interesting, could you explain in more detail why this would be a problem?
The decrease in the variance at low abundances is due to the discreteness of the counts near zero, which limits the possible variability of the log-transformed expression values and compromises the accuracy of the linear model. The decrease also messes up the loess fit by making the trend more complicated, as it is no longer monotonically decreasing with abundance; and it interferes with estimation of the prior degrees of freedom, as discreteness results in variance estimates that are much more precise than expected (simply because the estimates are constrained by small count sizes, so they can't vary much).