Hi,
Recently we have a list of RNAseq samples with different batch, age, gender and cell type. We want to get the normalized values of reads count using the variable stabilize transformation in DESeq2. Ideally we want to keep the difference between cell type, but remove the effects from the other covariances (e.g. batch, age, gender).
Two questions regarding this:
1. Should we use blind=TRUE for transformation?
2. If not, how do we design the formula?
I’ve read your vignettes and some relevant posts, such as
https://stat.ethz.ch/pipermail/bioconductor/attachments/20140124/67969d30/attachment.pl
https://stat.ethz.ch/pipermail/bioconductor/2014-February/057965.html
For question #1, I think I am pretty sure that I should use blind=FALSE (in order to keep the difference between cell types).
But then, which formula should I use for the design?
~ cellType
or
~ batch + age + sex + cellType
What’s the difference between them for VST result?
If I use the full formula "~ batch + age + sex + cellType”, will the VST internally remove effects of batch + age + sex, but keep difference for cell type?
Please advise.
Thanks,
Xianjun
----
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] DESeq2_1.2.10 RcppArmadillo_0.4.450.1.0
[3] Rcpp_0.11.1 GenomicRanges_1.14.4
[5] XVector_0.2.0 IRanges_1.20.7
[7] BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] annotate_1.40.1 AnnotationDbi_1.24.0 Biobase_2.22.0
[4] DBI_0.3.1 genefilter_1.44.0 grid_3.0.2
[7] lattice_0.20-23 locfit_1.5-9.1 RColorBrewer_1.0-5
[10] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2
[13] survival_2.37-7 XML_3.98-1.1 xtable_1.7-4
hi Xianjun,
"We want to get the normalized values of reads count using the variable stabilize transformation in DESeq2"
For what purpose do you want to normalize the counts: for example, is it for visualization of the results, or for performing quality assessment, or for performing some downstream analysis like clustering?
Also a note: your version of DESeq2 is from 1 year ago, and we recommend to use the latest versions (for DESeq2 that is 1.6).
Hi Mike,
Thanks for quick response (and note on the version).
We want to use the normalized counts for both visualizing/QC and downstream analysis. (I know to run DEseq() we don't need to input the normalized count). The downstream analysis we want to do is similar to DEseq, but simpler -- (1) define 'on' and 'off' genes by setting a cutoff according to the distribution of VST result; (2) define the 'part catalog' by doing this for each cell type. To take a mean for samples from the same cell types, we need to do cross-sample normalization. We also want to adjust for the batch, sex, age etc. That's why I need the normalized count. Does this make sense to you?
Of course, we will also run DEseq to get the differentially expressed genes after that.
hi Xianjun,
Note that the variance stabilizing transformation (and rlog tranformation) does not internally remove any effects from the log normalized counts. (See the DESeq manuscript for the mathematical description of the VST or the description in the DESeq2 subdirectory vignettes/vst.pdf.) The design of the DESeqDataSet, when blind=FALSE, is used only to estimate the gene-wise dispersions, which are used only to fit the experiment-wide dispersion-mean trend. The transformations use the information of the experiment-wide trend to transform the counts in such a way that the within group variance will be approximately stabilized across the range of counts.
We suggest blind=TRUE, when the transformation will be used to potentially identify outliers or when doing QC. However, especially when there are many differences in counts which can be accounted for by terms in the design, blind=FALSE is a better choice for visualization and downstream analysis. And it's not so much of a concern to use blind=FALSE, as the only information used from the steps involving the design is the experiment-wide trend of dispersion over mean, not the individual gene-wise estimates. So, use the full design for the VST.
In order to subtract effects of covariates, you can apply limma's removeBatchEffect() to the transformed count matrix.
Note that there are a number of factors influencing the count of reads for a gene, including the gene's length. So for something like an expression cutoff, I wouldn't assume this cutoff should be the same across genes. I don't have any more concrete suggestions for this part of the analysis though.
Also, I'd recommend binning 'age', instead of using it as a numeric predictor.
Thanks a lot for the comments!
So for something like an expression cutoff, I wouldn't assume this cutoff should be the same across genes.
I assumed the VST result is already sort of normalized count and comparable between genes. Am I wrong?
Yes, this is wrong. The VST transforms to a log2-like count which stabilizes the variance, and additionally accounts for library size differences of the samples. The VST does nothing to correct for different genes having different length.
I got it. Thanks.