Hello Everyone,
I'm currently analyzing a series of RNA-Seq experiments that are splitted into 4 runs. In order to account for the batch effect of each experiments, I'm trying to correct my gene counts using sva, as recommended inside the Bioconductor RNASeq workflow (http://www.bioconductor.org/help/workflows/rnaseqGene/#removing-hidden-batch-effects)
As I proceed in this phase, I got very confused regarding the methods and the procedures, so I wonder if you could answer me to these questions:
1) Do I need to account for the batch effect when I specify the design in the dds creation? e.g.: assuming that I'm screening for the differences between two different cell types, should I say design = ~type + batch or just design = type? Why?
2)in the RNASeq bioconductor workflow, when exporting the counts matrix to sva, it assumed that the data has already been normalized. Specifically, counts are exported in this fashion:
dat <- counts(dds, normalized=TRUE
This seems pretty confusing to me, as my understanding is that it is needed to remove the batch effect before performing the normalization. Is this true for DeSeq2? Do I need only to Estimate the size Factor before performing sva normalization? Do you have code examples that you can share?
3) Do you have examples other than the one reported in the tutorial for the batch effect removal? Are there any other bioConductor libraries that integrates the batch effect normalization into the DeSeq2 Workflow?
Apologies if the questions sounds naive, I'm a self learner and I think I still have a long road to walk!
Thanks in advance,
Daniele
I mostly agree but you should use only scaled counts -- for example,
counts(dds, normalized=TRUE)
-- with svaseq, and not anything on the log scale. Note this line of code:https://github.com/Bioconductor-mirror/sva/blob/master/R/svaseq.R#L36
The reason to use scaled counts is because DESeq() takes care of size factor estimation itself, so it doesn't make sense for svaseq to use up the surrogate variables estimating library size correction.
Oh, just to clarify, I meant that my preference was to use rlog and then pass the resulting normalized, regularized logCPM values to sva instead of svaseq. (As far as I can tell, the only difference between the sva and svaseq functions is the log transform in svaseq.)
Oh, I get it, I didn't read carefully enough.