Greetings,
I have been following the DESeq2 vignette to analyze a large number of RNA-seq samples. sizeFactors are relatively comparable for each sample. While investigating the various data transformations for visualization purposes, I found the VST (setting blind=false) more effective than using the log-transformed normalized counts (with a pseudocount of +1) at stabilizing the variance over the mean. Rlog seems to just keep running, which I assume is due to having lots of samples. Since I plan on incorporating even more samples, I was going to stick with the VST.
However, I noticed that the transformation appears to eliminate all the zero values from my counts matrix using these particular data. Comparison with the normalized counts suggests that these zero values are simply being scaled up (the value is identical in each case, ~3.5). Samples with higher counts are also scaled up, as would be expected, and biologically the results appear consistent between both the log-transformed normalized counts and the VST counts.
Heatmaps of specific genes of interest look highly similar between the two, inter-sample distances seem to make sense for both, and PCA of the samples show that samples group in a nearly identical and meaningful way, regardless of the input used.
I found a previous post with a similar issue, though it wasn't definitively answered if this is acceptable or not. I've gone through the vignette and the DESeq2 paper to look for insight, but I'm still not sure I understand fully what's happening here.
Thanks for your time.
Poor choice of words, I meant that samples with higher counts are still higher after the transformation. I was not visualizing what was happening with the lower counts using the VST correctly. This helps a lot. Thanks very much, Michael.