The first consideration is how you will be normalizing the samples. Library size normalization would be unwise if you're expecting DE in your rRNA genes, given that these make up the bulk of your reads. If you have enough coverage of the rest of the transcriptome, you could reasonably assume that most of those genes are not DE and use them to calculate normalization/size factors, e.g., with calcNormFactors
. If not... you're in trouble.
The next question is whether you have enough features for empirical Bayes shrinkage to work. I can't remember how many rRNA genes there are, but I didn't think there were that many - 5S, 5.8S, 28S and 40S, for eukaryotes (excluding repeats)? This limits the advantage to sharing information between genes. It's unlikely that the rest of the transcriptome will be of much help here, as rRNA genes are probably so highly expressed that they will be in a different part of the mean-dispersion trend compared to all other genes. Nonetheless, if you can get a stable estimate of the trended dispersion (i.e., without overfitting at high abundances for the few rRNA genes), you're good to go.
If you've overcome the two challenges above, then the rest is easy. The fact that the rRNA genes have high abundance is not a problem for hypothesis testing, as it just gives you more power to detect differential expression. This is usually a good thing - and in fact, sometimes too good, as one consequence of increasing power is that you may get significant DE with (very) small log-fold changes. If you don't want that, consider using things like glmTreat
to impose a minimum threshold on the log-fold change.
First, respond to answers with "add comment", not "add answer", unless you're answering yourself.
Next, the non-DE assumption is exactly that - an assumption that we have to make because we don't have any data that tells us otherwise. As a general rule, if we had information that allowed us to evaluate our non-DE assumption, we wouldn't need to make the assumption in the first place, because we could use that information to get the best possible estimate of the normalization factor. On a related note, there's no reliable way to assess whether your genes are non-DE prior to normalization, as all of the DE statistics (including the log-fold changes) depend on having the correct normalization in the first place.
If you have a reasonably sized subset of genes in which you can make the non-DE assumption, then you can compute normalization factors from those. Just subset the
DGEList
before applyingcalcNormFactors
(do not setkeep.lib.sizes=FALSE
during subsetting), and assign the resulting normalization factors back to the original DGEList (see<DGELIST>$samples$norm.factors
). I would require at least 1000 genes in the subset to consider the normalization factors stable, though I don't know whether you can easily get to this number in prokaryotes. If you don't have any genes in which the non-DE assumption holds, then you're in trouble. In such cases, you're effectively limited to testing for differential proportions rather than differential expression, using normalization based on the library sizes alone. This is much less desirable/interpretable.Finally, for the trended dispersion, you should use
plotBCV
to see if how the mean-dispersion trend is behaving for your high-abundance rRNA genes. A typical example of overfitting might be when the trend is decreasing with increasing abundance but then skews upward to fit the few rRNA genes. The loess smoother should use information from the other genes to stabilize the trend, but the effectiveness of the smoothing really depends on whether you have any coverage of the other genes.