Dear Bioconductor Team,
We sequenced samples A with 4 replicates and B with 4 replicates last year, and C with 3 replicates this year. Since there are no overlapped samples, I can't include batch effect in the DESeq model. So I tried to run SVAseq, and got 9 SVs.
However, the online workflow https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html, does not calculate the number of SVs, only used 2.
My question is: should I go with 2 or 9 for the subsequent DESeq2 analysis here?
Thanks for any inputs!
C.
HI, Michael,
Thank you very much for the response! Sounds like you don't feel it is too high? I saw your example with 8 samples from 4 cell lines and 2 dex conditions, but only specified 2 SVc. I have 3 conditions, A, B and C, but SVAseq calculated 9 SVs. So I am very concerned. Also I read some discussion that suggested that most variance could be caught in the first two SVs... What do you think?
C.
I think it is too high, but I’ll leave it to the SVA developers to comment.
Hi, Michael,
I actually have a question about your workflow:
you plotted SV values versus cell line (stripchart(svseq$sv[, i] ~ dds$cell, vertical = TRUE, main = paste0("SV", i))).
So what do you expect from these figurse? How do you evaluate if the cell line effects are correctly adjusted?
Some of my SVs look like below (image link: https://imgur.com/8OIR76I). Here, "1" and "2" correspond to the batch of last year and batch of this year, respectively. Batch 2 are generally more spreading. Batch 1 pattern changes more as I plotted along.
Do you think they are normal?
I wouldn’t typically estimate batch if I had the batches annotated. I think we have some sentence about this in the workflow. We “pretend” as if we didn’t have batch annotation and see if we could estimate it. So that plot is particular in that you normally wouldn’t estimate batch if you already had it annotated.
My issue here is that, unlike your example, the two batches contains completely different samples. I can't use the annotated batch in the DESeq2 design matrix. So I went to follow your section 8, to take the "sequencing year" as some hidden and unwanted effects. Since there are many SVs, I just wonder if I need to select some of them based on the plots before final DE call is performed...
Oh sorry I missed that. You’re stuck a bit. If you try to control for batch, even by estimating hidden variables, you will potentially remove the biological signal. Meanwhile you won’t be able to know what you see if biological or technical.
I wonder what you would go with such type of data?
Rather than try to estimate the technical effects using factor analysis (which doesn’t work here because of confounding) you can instead try to remove technical artifacts with per sample bias correction. Two methods that work on count matrices are RUVSeq and cqn, both Bioconductor packages. To use these you need to do some work compiling the gene features: GC and average transcript length. I don’t have code for this in the workflow.
If you were starting from reads, I would recommend Salmon with GC bias correction followed by tximport. This pipeline accounts for per sample biases automatically by passing along offsets.
Thank you very much, Michael! I will try these packages and see how the results look like...