As you've realized, filtering on raw counts can be problematic. For example, if you require at least n libraries to be above some raw count threshold, then you'll select for genes that are (possibly spuriously) upregulated in the group with larger libraries. Similarly, if you filter on the sum of counts across all libraries for each gene, you will favour genes that have higher expression in the larger libraries:
> a <- rnbinom(10000, mu=1000, size=20) # large library
> b <- rnbinom(10000, mu=100, size=20) # small library
> keep <- a + b > 1700 # row sum filter (arbitrary)
> median(log(a/b/10)) # should be near zero, as no DE present
[1] 0.004938763
> median(log(a/b/10)[keep]) # systematic bias towards larger library
[1] 0.4951755
This is because of the mean-variance relationship, whereby you're more likely to get a large count sum if the counts are randomly greater in the larger libraries. Check out our 2014 paper for some more simulations, albeit in the context of peak calling for ChIP-seq data.
All examples of filtering in the edgeR
user's guide all involve the use of CPMs, so differences in library sizes between samples or groups shouldn't pose a problem. In my analyses, I like to filter on the average abundance (as computed with the aveLogCPM
function) which also accounts for differences in library sizes between samples. If you stick to either of these strategies, you should be okay.
How likely is that filtering bias to be a problem if there are more than two libraries?
The use of only two libraries was a simplified example to make the problem obvious. It's just as likely to happen with more libraries. In all cases, the systematic bias would be toward larger libraries. If library sizes are unequal between treatment groups, this bias translates to bias in log-fold change estimation between groups.
Is it common for the treatment to be correlated with the library size? If not, the bias should level off given decent replication.
Well, you never know. It depends on how you sequenced your samples. If the operator made an attempt to match the cDNA quantities from each sample before sequencing, then you'd expect it would even out between treatments. However, that might not always be accurate - specifically, it seems that the quantification you get from spectrophotometry, Tapestation/Bioanalyzer/whatever, etc. doesn't always match up with the amount that you sequence. At the very least, different transcript compositions may make samples from one treatment easier to sequence than others. As a result, you can end up with correlations between the treatment condition and the library size, e.g., if some treatments result in smaller/fewer cells with less RNA or if the treatment itself affects library preparation efficiency.
Why bother with a filtering method that is susceptible to such biases when you could use an equally simple method that is not? I've already said this to you in other contexts, but I'll reiterate it here: assuming equal or nearly equal library sizes in NGS (or using a method that relies on such an assumption) is a recipe for disaster.
Ideally, I'd like to identify a simple tool for filtering low expressed features. In fact, I thought that such tool existed: I have seen the same RPKM cutoffs (0.5, 1, 5, 10) used in a few publications, so I thought that was it. Raw count based cutoffs also have an advantage of simplicity because they are not tied to any normalization method.
Now it seems like things became a lot more complicated. First, in limma pipeline it is hard to establish commonly acceptable cutoffs for CPM because limma's CPM can have up to four flavors depending on how the effective library size is computed (raw, TMM, UQ, RLE). On top of that, this post
A: voom for spectral counts
essentially says that the filtering rule is also tied to the underlying statistical model because another purpose of filtering is to make sure that the model is fed “nice” data. Is that all really necessary or there is some hope of simplifying things a bit?
Unfortunately, the appropriate filtering threshold is indeed dependent on both the data (e.g. sequencing depth) and on what you want to do with it. An RPKM-based filter is defintely not suitable for methods based on counts or CPM. Any abundance-based filter must take the sequencing depth into account. A lower sequencing depth requires a higher CPM threshold. And as pointed out by Aaron in the post you link to, filtering for voom is indeed tied to the need to avoid feeding voom and limma data that are a poor fit to their statistical models (spuriously low variance due to lack of distinct values at low counts and discreteness/non-normality of low counts, respectively). Filtering based on raw counts doesn't avoid any of these issues.
In any case, if you're more comfortable filtering on a raw count scale, you could choose your preferred raw count threshold and convert it to the equivalent CPM threshold by dividing by the geometric mean normalized library size.
Lastly, I'll mention the independent filtering method used by DESeq2, which automatically selects the CPM filtering threshold
that maximizes the number of genes significant at a specified FDR level.(Edit: This has changed in recent versions. See Mike Love's reply below.) It wouldn't be too hard to implement that for any other method. If you're looking for an "auto-pilot" for filtering, that's probably as close as you're going to get.But this method is still vulnerable to weird edge cases that are common in practice. For example, if no genes are significant, every threshold is equally valid by this criterion and the threshold chosen is essentially random.Also, implementing this method with voom could be tricky, since you would want to re-estimate the mean-variance trend after filtering, and because a trend thrown off by poorly-fitting low-count data could throw off the threshold selection, so you would have to test how well this works in practice.To go a bit further down the rabbit hole:
aveLogCPM
inedgeR
) is an independent filter statistic for NB-distributed counts, assuming you've used the right dispersion and offsets.So, depending on what statistical model you use, the definition of the correct filter will change. In practice, the density of genes at the filter boundary is usually quite low for RNA-seq data (i.e., most genes are either decently expressed or they aren't) so the analysis tends to be robust to reasonable choices of filter statistic. Still, if you can easily protect against differences in library size between your samples, why wouldn't you?
just a few comments:
wrt "It wouldn't be too hard to implement that for any other method.", it is a separate package actually, with p-values and covariate as input: genefilter.
wrt "if no genes are significant, every threshold is equally valid by this criterion": after receiving feedback from users on too greedy independent filtering for datasets with no DEG, I fixed this in DESeq2 in version 1.10. The lowest value of the threshold is chosen which is within noise from the maximum. See one the of items for 1.10 in NEWS. In addition, a more sophisticated option is being worked on in the Huber group.
Good to know. I wasn't aware of either of those.