Question

Disagreement between TMM (edgeR) and RLE (DESeq) normalisation for single-cell data

1

Entering edit mode

kieranrcampbell ▴ 50

@kieranrcampbell-11133

Last seen 7.0 years ago

Hi all,

I have an (admittedly poor quality) single-cell RNA-seq dataset. I was en-route to performing the "highly variable gene selection" from http://www.nature.com/nmeth/journal/v10/n11/extref/nmeth.2645-S2.pdf, and looked at the difference of the TMM normalisation factors from edgeR and RLE from DESeq2 (recommended in the above).

When the data is normalised there is good agreement on the mean expression of each gene, except for the highly expressed spike-ins, which I find totally bizarre. This has a profound effect on the spike-in's position in the CV2-mean plots we're looking for. All plots for this analysis can be found here:

http://rpubs.com/kieranrcampbell/single-cell-mysteries

Count estimates were made using Kallisto and collapsed to gene level using scater (which I believe uses tximport under-the-hood). Any help much appreciated,

Kieran

rnaseq single cell normalization edgeR deseq2 • 4.7k views

ADD COMMENT • link updated 8.6 years ago by Aaron Lun ★ 28k • written 8.7 years ago by kieranrcampbell ▴ 50

0

Entering edit mode

I don't know why the two methods disagree, but you might want to extract the normalization factors and examine them directly. Perhaps plot the log ratio of normalization factors vs other covaraites like total count to see if there is any apparent pattern.

ADD REPLY • link 8.6 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

To my knowledge, the difference is mainly due to the fact that TMM has a pre-normalization by library size before the actual normalization by the calculated size factor. Personally, I don't recommend normalizing single cell data by library directly, meaning using CPM, RPKM/FPKM or TPM value for downstream quantitative analysis. This will definitely lead to misinterpretation of your data.

See 10.1016/j.cell.2012.10.012, and https://haroldpimentel.wordpress.com/2014/12/08/in-rna-seq-2-2-between-sample-normalization/ for more information.

Personally, I will only normalize my single-cell data by the size factor derived from non-differentially expressed genes, i.e. ERCC, which obviously refutes the point raised in the Nature Methods paper you cited above. But this makes more sense to me, a biologist.

Gary

ADD REPLY • link 8.2 years ago garyhokawai • 0

score 5 · Answer 1 · 2016-08-30

5

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 hours ago

The city by the bay

For starters, there's a problem with your code, in that calcNormFactors returns normalization factors. This is subtly different from the size factors of DESeq. In edgeR, the equivalent to the size factor is actually the effective library size, i.e., the product of the normalization factor and the library size for each sample. That is, the normalization factor captures sample-specific bias beyond differences in library size. (Yes, this is confusing, even for people who should know better, e.g., Davis and I.) So, if you want size factors, you should do:

nf_tmm <- calcNormFactors(raw_counts, method = "TMM")
sf_tmm <- nf_tmm * colSums(raw_counts)

Anyway, in general, it doesn't seem appropriate to normalize spike-in counts with size/normalization factors computed from endogenous genes. The latter accounts for RNA content and will lead to over-normalization if applied to the former (which is not affected by RNA content, assuming you added the same amount of spike-in RNA to each cell). So, if you have a cell with lots of RNA, you'll have relatively higher coverage of endogenous genes, leading to a larger size factor to scale down the counts; but relatively lower coverage of the spike-in transcripts, whereby normalization scales down the spike-in counts further. It's usually better to compute a separate set of size factors for the spike-in transcripts, as was done in the Brennecke et al. paper.

If you persevere with applying gene-based factors to all of the features, you'll scale up the large spike-in counts and scale down the small spike-in counts, thus inflating the variance of the spike-in transcripts. This is what you observe after RLE normalization, though I'm puzzled by why it doesn't also occur after TMM normalization. I would suggest that you take a closer look at the normalized spike-in expression values across cells, and see how the profile changes with RLE and TMM normalization (especially for extreme factors). Both methods start behaving strangely when there's a lot of zeroes around, but they needn't be similar in their strangeness.

P.S. If you ever want to do the trend fitting and significance calculations of Brennecke et al. without digging up their code from the supplementaries, you can use the technicalCV2 function in scran instead.

P.P.S. This just came out, for anyone who's interested: http://f1000research.com/articles/5-2122/v1.

ADD COMMENT • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

Thanks very much for this. Entirely makes sense that if we use endogenous genes then we'll disproportionately change the (presumably constant) ERCCs. I think I'm still somewhat confused as there seems to be conflicting advice as to whether ERCCs are stable enough for valid estimates, and I think the PDF I listed to above actually uses the endogenous genes for normalisation, unless I've read the R code wrong. Either way, using just spike-ins to normalise in the above example removes the effect...and now I've scrapped it and used scran.

Thanks again!

ADD REPLY • link 8.6 years ago kieranrcampbell ▴ 50

2

Entering edit mode

If you're looking at the mouse/ERCC example, on page 28 of the linked PDF, there's the two lines:

sfMmus <- estimateSizeFactorsForMatrix( countsMmus )
sfERCC <- estimateSizeFactorsForMatrix( countsERCC )

... followed later by:

nCountsERCC <- t( t(countsERCC) / sfERCC )
nCountsMmus <- t( t(countsMmus) / sfMmus )

So the size factors do get computed and applied separately to the genes and spike-ins.

And oops, I forgot to mention normalization with computeSumFactors in scran. Yes, that'll work too.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

For exactly the reasons Aaron gives, I have calculated two sets of size factors, one for the endogenous genes (called sfMus in the mouse data in Section 4.1 or sfAt in the plant data in Section 3.2) and one for the spike ins (sfERCC and sfHeLa, respectively). In the paper, we have called them s_j^B and s_j, for the biological (endogeneous) and technical (spike-in) size factors, respectively.

The way how both sets of size factors are used together (as see in the last two equations in the Methods part of the paper) is a bit subtle and explained in Supplementary Note 6.

ADD REPLY • link 8.6 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Hi Simon,

Why don't you just normalize the endogenous genes counts with size factor derived from spike-ins. This makes more sense to me.

Gary

ADD REPLY • link 8.2 years ago garyhokawai • 0

0

Entering edit mode

Unfortunately, this is where theoretical elegance runs into practical problems with interpretation. The issue, as I discuss in the F1000Research link above, is that spike-in normalisation preserves differences in total RNA content between cells, whereas normalisation by endogenous gene counts does not. Now, if you're interested in differences in total RNA content, then spike-in normalisation is the way to go, and I know of a few projects where this is an explicit priority. However, in most cases, changes in RNA content are not of particular interest, and interpretation becomes more difficult than it needs to be if you get a whole stack of highly variable/differentially expressed genes (or equivalently, the first few principal components in PCA) being driven by a global increase in expression.

ADD REPLY • link 8.2 years ago Aaron Lun ★ 28k