Hi,
For RNAseq data I have used DESeq2's rlog transformed counts for making exploratory plots and quality assessment of my dataset. But I didn't realize that it used normalized counts. I am afraid that this will "force" the samples to look more similar, e.g. in a boxplot of the counts each sample. Am I right about this? And what is the reason for using normalized counts? Is it better to use log-transformed raw counts if one wants to compare whether some samples are very deviant from others, e.g. in boxplots?
Jon
Thanks for your answer.
Now, this might not be an exploratory analysis, but to give a concrete example I have single-cell transcriptome data and I know that all the samples were sequenced to an approximately equal depth. But I have no idea how PCR-bias have influenced the gene expression and what the distribution of reads across genes are. If I make a boxplot of log2 transformed counts I get this and with rlog the samples look much more uniform. I know that single-cell data is perhaps not the best example, but I am afraid that using rlog conceals the variation? Or am I mistaken about the use of this transformation/normalization?
Noting the fact that you are explicitly talking about single cell RNA-seq data is probably something worth updating your question to point out :-)
Unfortunately I haven't had the opportunity to really dig into the universe of single cell rnaseq. The first concern I'd have with using DESeq2's rlog or vst transforms is that I'm not sure if it's well suited to deal with the drop out effects one observes in single cell rnaseq data ... perhaps Mike or Wolfgang can chime in here with their thoughts.
If it were me, I'd first approach the analysis of single cell rnaseq data using Aaron Lun's f1000 workflow, and then deviate from that only when I feel more comfortable with how the scRNA-seq data that I get behaves.
hi,
The rlog won't perform well with data which strongly deviates from negative binomial, which single cell RNA-seq certainly does, because you often have genes which are mostly 0's but then highly expressed in a minority of cells. The rlog will likely overshrink these differences for these genes. It has to do with the construction of the rlog.
The vst() is just a monotonic function applied to normalized counts, so this is safer.
Or you could use normTransform() which is log2(normalized counts + 1), perhaps with a higher pseudocount to help stabilize the variance (see meanSdPlot as in the DESeq2 vignette).
Re peculiarities of single cell data:
Re normalization in VST, rlog: conceptually, both of these only make sense after correction for technical biases such as sequencing depth, it is unclear how to even define them otherwise. For raw data QC, just use plain old log(x+1), or asinh(x)