Question

SVAseq and DESeq2 workflow

0

Entering edit mode

capricygcapricyg ▴ 10

@capricygcapricyg-17892

Last seen 2.4 years ago

United States

Dear Bioconductor Team,

We sequenced samples A with 4 replicates and B with 4 replicates last year, and C with 3 replicates this year. Since there are no overlapped samples, I can't include batch effect in the DESeq model. So I tried to run SVAseq, and got 9 SVs.

However, the online workflow https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html, does not calculate the number of SVs, only used 2.

My question is: should I go with 2 or 9 for the subsequent DESeq2 analysis here?

Thanks for any inputs!

C.

sva deseq2 • 7.2k views

ADD COMMENT • link updated 6.1 years ago by Michael Love 42k • written 6.1 years ago by capricygcapricyg ▴ 10

score 1 · Answer 1 · 2018-10-19

1

Entering edit mode

Michael Love 42k

@mikelove

Last seen 14 hours ago

United States

You can use the example code in the workflow with any number of surrogate variables (or FUVs etc).

The number of SVs seems too high here. 9 SVs for 11 samples?

ADD COMMENT • link 6.1 years ago Michael Love 42k

0

Entering edit mode

HI, Michael,

Thank you very much for the response! Sounds like you don't feel it is too high? I saw your example with 8 samples from 4 cell lines and 2 dex conditions, but only specified 2 SVc. I have 3 conditions, A, B and C, but SVAseq calculated 9 SVs. So I am very concerned. Also I read some discussion that suggested that most variance could be caught in the first two SVs... What do you think?

C.

ADD REPLY • link 6.1 years ago capricygcapricyg ▴ 10

0

Entering edit mode

I think it is too high, but I’ll leave it to the SVA developers to comment.

ADD REPLY • link 6.1 years ago Michael Love 42k

0

Entering edit mode

Hi, Michael,

I actually have a question about your workflow:

you plotted SV values versus cell line (stripchart(svseq$sv[, i] ~ dds$cell, vertical = TRUE, main = paste0("SV", i))).

So what do you expect from these figurse? How do you evaluate if the cell line effects are correctly adjusted?

Some of my SVs look like below (image link: https://imgur.com/8OIR76I). Here, "1" and "2" correspond to the batch of last year and batch of this year, respectively. Batch 2 are generally more spreading. Batch 1 pattern changes more as I plotted along.

Do you think they are normal?

ADD REPLY • link 6.1 years ago capricygcapricyg ▴ 10

0

Entering edit mode

I wouldn’t typically estimate batch if I had the batches annotated. I think we have some sentence about this in the workflow. We “pretend” as if we didn’t have batch annotation and see if we could estimate it. So that plot is particular in that you normally wouldn’t estimate batch if you already had it annotated.

ADD REPLY • link 6.1 years ago Michael Love 42k

0

Entering edit mode

My issue here is that, unlike your example, the two batches contains completely different samples. I can't use the annotated batch in the DESeq2 design matrix. So I went to follow your section 8, to take the "sequencing year" as some hidden and unwanted effects. Since there are many SVs, I just wonder if I need to select some of them based on the plots before final DE call is performed...

ADD REPLY • link 6.1 years ago capricygcapricyg ▴ 10

0

Entering edit mode

Oh sorry I missed that. You’re stuck a bit. If you try to control for batch, even by estimating hidden variables, you will potentially remove the biological signal. Meanwhile you won’t be able to know what you see if biological or technical.

ADD REPLY • link 6.1 years ago Michael Love 42k

0

Entering edit mode

I wonder what you would go with such type of data?

ADD REPLY • link 6.1 years ago capricygcapricyg ▴ 10

0

Entering edit mode

Rather than try to estimate the technical effects using factor analysis (which doesn’t work here because of confounding) you can instead try to remove technical artifacts with per sample bias correction. Two methods that work on count matrices are RUVSeq and cqn, both Bioconductor packages. To use these you need to do some work compiling the gene features: GC and average transcript length. I don’t have code for this in the workflow.

If you were starting from reads, I would recommend Salmon with GC bias correction followed by tximport. This pipeline accounts for per sample biases automatically by passing along offsets.

ADD REPLY • link 6.1 years ago Michael Love 42k

0

Entering edit mode

Thank you very much, Michael! I will try these packages and see how the results look like...

ADD REPLY • link 6.1 years ago capricygcapricyg ▴ 10