Question

DEseq2: DESeq function not correcting for batch variable

0

Entering edit mode

jmudd1 • 0

@jmudd1-24289

Last seen 4.3 years ago

Hello,

I have a timecourse experiment where I am comparing gene expression profiles at pre-exposure and multiple timepoints post-infection. I also know that some of these samples needed an RNA clean-up while others did not. Thus, my colData contains two columns, condition (the timepoint column) and clean_up:

                  condition clean_up
RH_FR04_PRE   pre_infection       no
RH_GH99_PRE   pre_infection       no
RH_HB37_PRE   pre_infection       no
RH_HD09_PRE   pre_infection       no
RH_FR04_1dpi           1dpi       no
RH_GH99_1dpi           1dpi      yes
RH_HB37_1dpi           1dpi       no
RH_HD09_1dpi           1dpi       no
RH_FR04_3dpi           3dpi       no
RH_GH99_3dpi           3dpi       no
RH_HB37_3dpi           3dpi       no
RH_HD09_3dpi           3dpi       no
RH_FR04_7dpi           7dpi       no
RH_GH99_7dpi           7dpi       no
RH_HB37_7dpi           7dpi       no
RH_HD09_7dpi           7dpi       no
RH_FR04_14dpi         14dpi      yes
RH_GH99_14dpi         14dpi      yes
RH_HB37_14dpi         14dpi      yes
RH_HD09_14dpi         14dpi      yes
RH_FR04_21dpi         21dpi      yes
RH_GH99_21dpi         21dpi      yes
RH_HB37_21dpi         21dpi      yes
RH_HD09_21dpi         21dpi      yes

If I run DESeqDataSetFromMatrix and only focus on the condition column:

ddsMat <- DESeqDataSetFromMatrix(countData = countData.rounded.nonZero.nonLowVar, colData = coldata, design= ~ condition)

I then examine the rlog transformed dds object made from the DESeqDataSet my PCA looks like this

https://ibb.co/bNgZMft

Distinctions are very hard to make out. Moreover, I have a guess that the clean up may indeed be a variable that is affecting the results. For example, a plotMA of timepoints that needed clean-up seem to have many signficiant DEGs with very low read counts:

plotMA for timepoint comparison needing clean-up:

https://ibb.co/NpWJkT3

plotMA for timepoint where samples did not need clean up:

https://ibb.co/kgwMKMR

So....I go back and run DESeqDataSetFromMatrix but now trying to control for clean_up:

ddsMat <- DESeqDataSetFromMatrix(countData = countData.rounded.nonZero.nonLowVar, colData = coldata, design= ~ condition + clean_up)

However this gives me the exact SAME PCA plot:

https://ibb.co/bNgZMft

the inclusion of "clean up in the design option is making no difference.

In the DESeq2 manual, I also tried the R package sva which seems to infer batch effect in an unsupervised fashion:

dat  <- counts(dds, normalized = TRUE)
idx  <- rowMeans(dat) > 1
dat  <- dat[idx, ]
mod  <- model.matrix(~ condition, colData(dds))
mod0 <- model.matrix(~   1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv = 2)

ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + condition + clean_up
dds.sva=DESeq(ddssva)

So the design compares condition while controlling for SV1 SV2 and clean-up.

Again, this make no difference in any of the downstream functions for my dds object, including PCA, results, plotMA, or summary of the data.

My question is, is it possible in the data given to tell if I am doing something wrong in the pipeline? The inclusion of the "clean_up" variable I want to control for does not appear to make any difference, nor do columns caclulated by sva. Or alternatively, maybe they are being incorporated correclty but their effect is too small to make a difference in the data?

thanks so much.

deseq2 • 590 views

ADD COMMENT • link updated 4.3 years ago by Michael Love 43k • written 4.3 years ago by jmudd1 • 0

score 1 · Accepted Answer · 2020-10-05

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 23 hours ago

United States

Batch / “clean up” is nearly entirely confounded with the condition and therefore post hoc efforts at batch correction via inclusion of design factors will be ineffective.

Re: PCA and batch => see the FAQ in the vignette.

ADD COMMENT • link 4.3 years ago Michael Love 43k

0

Entering edit mode

Thank you Michael for getting back to me so quick.

Actually, when supplying the batch effect in the design it does indeed alter the outputs of the results function for my comparisons, and DEG numbers are more in line biologically with how I knew the study to proceed

One thing that I noticed is that the filter threshold specified in the results function varies widely between each timepoint. For example:

###day 1 post-infection
metadata(res.sva_1dpi)$filterThreshold
32.95918% 
 18.79688 

###day 3 post-infection
metadata(res.sva_3dpi)$filterThreshold
46.53061% 
 54.90648 

###day 7 post-infection
metadata(res.sva_7dpi)$filterThreshold
15.5102% 
 4.71706 

###day 14 post-infection
metadata(res.sva_14dpi)$filterThreshold
50.40816% 
 69.24121

###day 21 post-infection
metadata(res.sva_21dpi)$filterThreshold```
27.14286% 
 11.18829

My question is, is it more appropriate to manually set a threshold in some way so that it is consistent across timepoints? Reason I ask is that (from my understanding in the manual), by the automatic threshold, some timepoints will be more powered than others given the portion of genes being considered for differential expression varies across timepoint, and will thus affect the FDR of each comparison.

thanks again.

ADD REPLY • link 4.3 years ago jmudd1 • 0

1

Entering edit mode

You can filter at the beginning of the analysis eg by requiring X number of samples with a count of 10 or higher. And then set independentFiltering=FALSE.

ADD REPLY • link 4.3 years ago Michael Love 43k