For starters, there's a problem with your code, in that calcNormFactors
returns normalization factors. This is subtly different from the size factors of DESeq. In edgeR, the equivalent to the size factor is actually the effective library size, i.e., the product of the normalization factor and the library size for each sample. That is, the normalization factor captures sample-specific bias beyond differences in library size. (Yes, this is confusing, even for people who should know better, e.g., Davis and I.) So, if you want size factors, you should do:
nf_tmm <- calcNormFactors(raw_counts, method = "TMM")
sf_tmm <- nf_tmm * colSums(raw_counts)
Anyway, in general, it doesn't seem appropriate to normalize spike-in counts with size/normalization factors computed from endogenous genes. The latter accounts for RNA content and will lead to over-normalization if applied to the former (which is not affected by RNA content, assuming you added the same amount of spike-in RNA to each cell). So, if you have a cell with lots of RNA, you'll have relatively higher coverage of endogenous genes, leading to a larger size factor to scale down the counts; but relatively lower coverage of the spike-in transcripts, whereby normalization scales down the spike-in counts further. It's usually better to compute a separate set of size factors for the spike-in transcripts, as was done in the Brennecke et al. paper.
If you persevere with applying gene-based factors to all of the features, you'll scale up the large spike-in counts and scale down the small spike-in counts, thus inflating the variance of the spike-in transcripts. This is what you observe after RLE normalization, though I'm puzzled by why it doesn't also occur after TMM normalization. I would suggest that you take a closer look at the normalized spike-in expression values across cells, and see how the profile changes with RLE and TMM normalization (especially for extreme factors). Both methods start behaving strangely when there's a lot of zeroes around, but they needn't be similar in their strangeness.
P.S. If you ever want to do the trend fitting and significance calculations of Brennecke et al. without digging up their code from the supplementaries, you can use the technicalCV2
function in scran instead.
P.P.S. This just came out, for anyone who's interested: http://f1000research.com/articles/5-2122/v1.
I don't know why the two methods disagree, but you might want to extract the normalization factors and examine them directly. Perhaps plot the log ratio of normalization factors vs other covaraites like total count to see if there is any apparent pattern.
To my knowledge, the difference is mainly due to the fact that TMM has a pre-normalization by library size before the actual normalization by the calculated size factor. Personally, I don't recommend normalizing single cell data by library directly, meaning using CPM, RPKM/FPKM or TPM value for downstream quantitative analysis. This will definitely lead to misinterpretation of your data.
See 10.1016/j.cell.2012.10.012, and https://haroldpimentel.wordpress.com/2014/12/08/in-rna-seq-2-2-between-sample-normalization/ for more information.
Personally, I will only normalize my single-cell data by the size factor derived from non-differentially expressed genes, i.e. ERCC, which obviously refutes the point raised in the Nature Methods paper you cited above. But this makes more sense to me, a biologist.
Gary