How to deal with low-depth RNA-Seq within DESeq2 or edgeR
2
0
Entering edit mode
David R ▴ 90
@david-rengel-6321
Last seen 7 weeks ago
European Union

Hi,

I am dealing with RNA-Seq data for which the PI of the project would like to have DEGs, but I am having second thoughts about the nature of the data and its suitability for differential analysis.

The data come from infected plants, in which only reads from the infecting microorganism have been kept. There are two time points in the infection, one of them a very early one. The number of reads for each sample is very variable, ranging between 100K and 8M reads. Only 5K genes present at least 3 reads in at least one condition (all replicates for one condition considered). This is due to the fact that, especially at early stages, the proportion plant/microorganism is very low in the sequenced library.

Overall, I think that the data are hardly suitable for differential analysis because I assume that the distribution of the genes will not be the same in all samples and neither can we assume the hypothesis that most genes are not DE.

One alternative I would have tried is to treat plant and microorganism data altogether. Should this help stabilizing the data?

It should be noted that no House Keeping Genes are available for the species I am talking about. I mention this cause some authors use HKG under similar circumstances as ours In order to normalize the data.

Any help or advice will be most greatly appreciated,

David

rnaseq normalization differential gene expression edger deseq2 • 4.6k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 16 hours ago
The city by the bay

From what you've described, you're in a spot of bother. The problem is not so much the difference in the library sizes - this can be handled by the model - but rather, the fact that you're unwilling to assume that most genes are not DE. This limits your options for normalization, as both TMM and DESeq's size factor method cannot be used. I guess you don't have spike-in RNA either, which is a standard strategy (in single-cell RNA-seq, at least) for getting rid of technical biases in cells that have very different transcriptomic profiles. So, you have two options:

  • Normalize by library size, and hope for the best. Technically speaking, library size normalization assumes that the only difference between samples is due to sequencing depth - almost any DE will introduce composition biases, excepting some rather artificial scenarios where the upregulation for some genes perfectly cancels out the downregulation for other genes. This strategy errs on the side of caution and will under-normalize your data, but it shouldn't do anything horribly wrong.
  • Try to define some house-keeping genes a priori. For example, constitutively expressed genes like ribosomal proteins should be okay, as well as genes involves core metabolic processes (glycolysis, perhaps - I can't remember my biochemistry). This might make the assumption of a non-DE majority more reasonable if you only apply TMM to these genes. However, it depends on having some good annotation for your organism, as well as enough knowledge about what processes are "constant enough".

I would probably go with the first option, simply because I don't know enough biology; and then take the DE results with a tablespoon of salt, at least with respect to looking for changes in expression. (A DE analysis following library size normalization will instead find changes in the proportion of reads assigned to each gene between conditions. This is not the same as a standard DE analysis, or as interpretable due to the interaction with all other genes.) There's no point throwing the plant data in for normalization, because then you introduce another factor, i.e., the ratio of plant RNA to microbial RNA, which doesn't help in correcting for biases in the microbe counts.

Another problem is with your filtering, which is not blind to the condition that each library comes from. Because your libraries are so imbalanced, the genes you retain after filtering are more likely to be upregulated in your later time point, i.e., it's a lot easier to get a gene with 3 or more reads in the 8M libraries compared to a gene with 3 or more reads in the 100K libraries. This is problematic as you'll bias the downstream results when you filter in a manner that's not independent of the test statistics. It'd be safer to filter on the CPMs, or to use the aveLogCPM function to compute an average abundance for filtering - see the edgeR user's guide for details.

ADD COMMENT
0
Entering edit mode

Hi Aaron,

Thanks for your helpful reply. You're absolutely right about filtering on cpms and not on counts, I'll do that. In order to give you an idea of the idea concerning my issues with normalization and DEGs, out of the 5.7K genes kept; 3.2 were declared DE. Far too many, I assume. On the other hand, you rightly mention that the pipeline accounts for lib size differences. However, I would like to point out that, in this particular case, size differences may not come, at least not primarily, from library synthesis or sequencing bias, but from differences of the spread of the pathogen in planta between time points. In other words, even if sequencing had provided exactly the same amount of reads per library, I would still have got big count differences on the pathogen because the proportion plant/pathogen is much bigger at the early time point. That is why I was thinking of getting the plant data, in order to have a better understanding of what is really going on. I’ve also managed to get data on five putative HKGs, so I may explore that as well.

Thanks again,

David

ADD REPLY
0
Entering edit mode

Well, 3.2K out of 5.7K is quite drastic but not cause for alarm, in and of itself. We get similar proportions when comparing between unrelated cell types or cancer/normal cell lines. As long as the (true) number of DE genes changing in each direction is similar, TMM normalization should still be functional - calcNormFactors can tolerate 60% of DEGs as long as they are evenly split in each direction. (Note that I'm referring to the true, not the observed number of DE genes in each direction, because getting the latter numbers already assumes that your normalization was correct.)

As for the plant data; this will not help with normalizing the microbe counts. I don't know what you're planning, but there are three possible outcomes from including the plant counts in the TMM procedure (or any related method, really):

  1. There are many more microbe genes than plant genes, and these dominate the normalization factor calculation; you end up with the same result as if you had just normalized with the microbe genes alone.
  2. The plant genes dominate the calculation. This means that the normalization factors will focus on removing biases in the plant counts. Thus, they not be able to account for changes in the total amount of microbial RNA between time points, leading to lots of "upregulation" in microbial genes at the later time point. I presume this is not interesting as you would then detect DE even if the per-cell expression of each microbial gene does not change.
  3. Both sets of genes contribute to the calculation, in which case you get an amaglam of the first two scenarios. This is not interpretable - what biases are being corrected? And to what extent? Who knows?
ADD REPLY
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 3 months ago
Icahn School of Medicine at Mount Sinai…

It is not required for every sample to have the same sequencing depth in order to assess differential expression. When you say that only 5k genes from the microorganism have sufficient reads, what fraction of the microorganism's genome is that?

In any case, it is important to precisely define what you mean by differential expression in this context, so you can choose a normalization method that supports your desired analysis. If the total fraction of reads coming from the microorganism increases from one condition to another, would you regard that as upregulation of all the microorganism's genes? My guess is probably not, in which case you need to normalize out those changes. This means that you don't want to normalize to the abundance of the host genes. So as a start, subset to only the microorganism genes and normalize as usual, using TMM. Then filter based on the output of aveLogCPM on the full data set. I recommend looking at a histogram of the aveLogCPM values and choosing a threshold between the signal mode and noise mode. Then perform your differential expression analysis, and look at your MA plot. If the low-expression genes' fold changes are biased in the direction of the more deeply-sequenced group, then you likely need to filter more stringently.

Ultimately, you will probably filter out a lot of genes due to low counts. So when you deliver the list of DEGs, you need to be very clear that this is an incomplete list due to the very limited sequencing depth.

ADD COMMENT
0
Entering edit mode

Hi Ryan

Thanks for your message and your detail on filtering on Avelogcpm. I’ll explore that. I’ve given a reply to Aaron’s comment here above. But I would like to add that the total genome has 37K annotated genes. However, annotation is far from optimal, and most probably, “real” genes should be fewer. Indeed, poor annotation of this spp is another issue, and it won’t be improved in the short term

Thanks again,

David.

 

ADD REPLY

Login before adding your answer.

Traffic: 565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6