Question

limma voom's (and edger's) use of scaling factors

0

Entering edit mode

biominer ▴ 10

@biominer-7701

Last seen 7.3 years ago

European Union

I am using limma-voom for an RNA-Seq dataset with global down-regulation of gene-expression (experimentally confirmed). On top of that there is also a small set of genes in this dataset that is massively(!) up-regulated. I can't use RLE or TMM normalization because they normalize out the down-regulation.

I created a DGEList object and then set normalization factors to 1 (using edgeR's calcNormFactors function with method "none").

My question now is if there is still normalization by total library size happening if I proceed by applying the voom function and proceed as usual ... (or also alternatively analyze the data using edger). Or would I have to provide custom scaling factors to correct for library size?

Originally I had tried to set the scaling factors in the DGEList object to the ratio: (library size) / (median library size over all samples) but that didn't work as expected. So I got my doubts. As far as I got it the scaling factors are used in the model. Is the library size used separately (how do they play together then)?

I'd also be thankful for any general input about the best normalization option for a situation as described in the beginning.

limma limma voom normalization edger • 2.7k views

ADD COMMENT • link updated 7.4 years ago by Aaron Lun ★ 28k • written 7.4 years ago by biominer ▴ 10

score 1 · Answer 1 · 2017-11-30

In edgeR and voom, there exists the concept of the effective library size, i.e., the product of the library size and the normalization factor for each sample. Expression values are normalized by (effectively) dividing the count for each sample by its effective library size. Thus, if you set the normalization factors to 1, the effective library size is just equal to the library size, which means that you'll be normalizing by the library size.

If many of your genes are expected to be DE in one direction, then you're in a difficult situation regarding normalization. As you can imagine, the library size will not provide an accurate representation of the bias between samples, because it is affected by the large-scale biological changes between your conditions (with or without cDNA quantification). If you want to use library size normalization, you must either:

assume that the downregulation is fully cancelled out by the upregulation of the minority of genes, such that the library size should not change between conditions (which is pretty unlikely), or
accept that you are testing for differential proportions rather than differential abundances of genes.

In single-cell RNA-seq data analyses, we usually handle situations like these by adding a constant amount of spike-in RNA to each cell, and using the coverage of (known non-DE) spike-ins for normalization. However, this seems harder to do for bulk experiments.