From what you've described, you're in a spot of bother. The problem is not so much the difference in the library sizes - this can be handled by the model - but rather, the fact that you're unwilling to assume that most genes are not DE. This limits your options for normalization, as both TMM and DESeq's size factor method cannot be used. I guess you don't have spike-in RNA either, which is a standard strategy (in single-cell RNA-seq, at least) for getting rid of technical biases in cells that have very different transcriptomic profiles. So, you have two options:
- Normalize by library size, and hope for the best. Technically speaking, library size normalization assumes that the only difference between samples is due to sequencing depth - almost any DE will introduce composition biases, excepting some rather artificial scenarios where the upregulation for some genes perfectly cancels out the downregulation for other genes. This strategy errs on the side of caution and will under-normalize your data, but it shouldn't do anything horribly wrong.
- Try to define some house-keeping genes a priori. For example, constitutively expressed genes like ribosomal proteins should be okay, as well as genes involves core metabolic processes (glycolysis, perhaps - I can't remember my biochemistry). This might make the assumption of a non-DE majority more reasonable if you only apply TMM to these genes. However, it depends on having some good annotation for your organism, as well as enough knowledge about what processes are "constant enough".
I would probably go with the first option, simply because I don't know enough biology; and then take the DE results with a tablespoon of salt, at least with respect to looking for changes in expression. (A DE analysis following library size normalization will instead find changes in the proportion of reads assigned to each gene between conditions. This is not the same as a standard DE analysis, or as interpretable due to the interaction with all other genes.) There's no point throwing the plant data in for normalization, because then you introduce another factor, i.e., the ratio of plant RNA to microbial RNA, which doesn't help in correcting for biases in the microbe counts.
Another problem is with your filtering, which is not blind to the condition that each library comes from. Because your libraries are so imbalanced, the genes you retain after filtering are more likely to be upregulated in your later time point, i.e., it's a lot easier to get a gene with 3 or more reads in the 8M libraries compared to a gene with 3 or more reads in the 100K libraries. This is problematic as you'll bias the downstream results when you filter in a manner that's not independent of the test statistics. It'd be safer to filter on the CPMs, or to use the aveLogCPM
function to compute an average abundance for filtering - see the edgeR user's guide for details.
Hi Aaron,
Thanks for your helpful reply. You're absolutely right about filtering on cpms and not on counts, I'll do that. In order to give you an idea of the idea concerning my issues with normalization and DEGs, out of the 5.7K genes kept; 3.2 were declared DE. Far too many, I assume. On the other hand, you rightly mention that the pipeline accounts for lib size differences. However, I would like to point out that, in this particular case, size differences may not come, at least not primarily, from library synthesis or sequencing bias, but from differences of the spread of the pathogen in planta between time points. In other words, even if sequencing had provided exactly the same amount of reads per library, I would still have got big count differences on the pathogen because the proportion plant/pathogen is much bigger at the early time point. That is why I was thinking of getting the plant data, in order to have a better understanding of what is really going on. I’ve also managed to get data on five putative HKGs, so I may explore that as well.
Thanks again,
David
Well, 3.2K out of 5.7K is quite drastic but not cause for alarm, in and of itself. We get similar proportions when comparing between unrelated cell types or cancer/normal cell lines. As long as the (true) number of DE genes changing in each direction is similar, TMM normalization should still be functional -
calcNormFactors
can tolerate 60% of DEGs as long as they are evenly split in each direction. (Note that I'm referring to the true, not the observed number of DE genes in each direction, because getting the latter numbers already assumes that your normalization was correct.)As for the plant data; this will not help with normalizing the microbe counts. I don't know what you're planning, but there are three possible outcomes from including the plant counts in the TMM procedure (or any related method, really):