Question

edgeR /DESeq2 normalization for differential expression in RNA-seq blood samples

2

Entering edit mode

aec ▴ 90

@aec-9409

Last seen 4.9 years ago

Dear all,

Both DESeq2 and edgeR normalization methods take into account different library sizes and RNA composition between samples, but are they able to account for a high difference of hemoglobin content in human blood samples? In my experiment, a single globin gene consumes between 2% - 50% of the sequencing effort depending on the sample.

Thanks,

hemoglobin edgeR deseq2 normalization rnaseq • 3.1k views

ADD COMMENT • link updated 6.2 years ago by Ryan C. Thompson ★ 7.9k • written 6.2 years ago by aec ▴ 90

score 3 · Answer 1 · 2019-02-15

Methods like TMM normalization (in calcNormFactors()) or median-based normalization (in DESeq) were designed for exactly the scenario you describe. If one sample has more hemoglobin mRNAs, the coverage of all other genes will be suppressed when the total amount of sequencing resource is fixed. This is what is known as "composition bias", and is the whole motivation for computing scaling factors (normalization factors in edgeR, size factors in DESeq - note that these are not the same thing!) that are not simply derived from the library size of each sample.

In practice, the success of normalization depends on the presence of sufficiently large counts for all the non-hemoglobin genes. All of these methods operate on ratios, and once the counts get too small, the ratios become unstable or undefined, requiring some ad hoc workarounds to avoid nonsensical scaling factor estimates. You can check that this is not the case in your data by creating MA plots for each sample (e.g., with plotSmear); if you see lots of discrete lines or patterns on the left, your counts are probably too low.

The other consideration is that, if the proportion of hemoglobin is highly variable, you will want to set robust=TRUE in downstream edgeR functions for dispersion estimation. This ensures that the increased variance of the hemoglobin genes will not inflate the apparent variability of the variances during empirical Bayes shrinkage. One could also filter out the hemoglobins entirely from the analysis, though this may not be sufficient; such high variability in hemoglobins can be a symptom of an underlying source of variability that affects other genes.

score 3 · Answer 2 · 2019-02-15

Aaron's answer adequately covers the theoretical reasons that the normalizations used in edgeR and DESeq2 are appropriate for data with variations in globin content, so I will just add that empirically, I have actually used edgeR on such a data set. Specifically, we were testing a custom globin blocking protocol, so by design there were large differences in globin content between the globin-blocked samples and the non-globin-blocked control samples. The normalization performed exactly as desired. You can see the resulting MA plot here, showing all the globin genes with large negative fold changes and all other genes centered around zero, indicating proper normalization:

https://darwinawardwinner.github.io/resume/examples/Salomon/globin/figure4%20-%20maplot-colored.pdf

score 0 · Answer 3 · 2019-02-15

Yes I think so normalization with DESeq2 and edgeR does take into account the library composition (meaning the composition between samples). In DESeq2 the normalization is done via a sclaing factor which is calculated via the geometric mean. The geometric mean does not emphasize on "outliers". Furthermore the scaling factor uses the median of all genes per sample putting more emphasize on housekeeping genes / moderately expressed genes.

I am not sure what your biological background is could`nt one just get rid of the hemoglobin stuff before? Like with the ribodepletion? I do not know if there are any papers on this issue but if you already have some data you could just check the effect on your own.