Hello,
I have a timecourse experiment where I am comparing gene expression profiles at pre-exposure and multiple timepoints post-infection. I also know that some of these samples needed an RNA clean-up while others did not. Thus, my colData contains two columns, condition (the timepoint column) and clean_up:
condition clean_up
RH_FR04_PRE pre_infection no
RH_GH99_PRE pre_infection no
RH_HB37_PRE pre_infection no
RH_HD09_PRE pre_infection no
RH_FR04_1dpi 1dpi no
RH_GH99_1dpi 1dpi yes
RH_HB37_1dpi 1dpi no
RH_HD09_1dpi 1dpi no
RH_FR04_3dpi 3dpi no
RH_GH99_3dpi 3dpi no
RH_HB37_3dpi 3dpi no
RH_HD09_3dpi 3dpi no
RH_FR04_7dpi 7dpi no
RH_GH99_7dpi 7dpi no
RH_HB37_7dpi 7dpi no
RH_HD09_7dpi 7dpi no
RH_FR04_14dpi 14dpi yes
RH_GH99_14dpi 14dpi yes
RH_HB37_14dpi 14dpi yes
RH_HD09_14dpi 14dpi yes
RH_FR04_21dpi 21dpi yes
RH_GH99_21dpi 21dpi yes
RH_HB37_21dpi 21dpi yes
RH_HD09_21dpi 21dpi yes
If I run DESeqDataSetFromMatrix and only focus on the condition column:
ddsMat <- DESeqDataSetFromMatrix(countData = countData.rounded.nonZero.nonLowVar, colData = coldata, design= ~ condition)
I then examine the rlog transformed dds object made from the DESeqDataSet my PCA looks like this
Distinctions are very hard to make out. Moreover, I have a guess that the clean up may indeed be a variable that is affecting the results. For example, a plotMA of timepoints that needed clean-up seem to have many signficiant DEGs with very low read counts:
plotMA for timepoint comparison needing clean-up:
plotMA for timepoint where samples did not need clean up:
So....I go back and run DESeqDataSetFromMatrix but now trying to control for clean_up:
ddsMat <- DESeqDataSetFromMatrix(countData = countData.rounded.nonZero.nonLowVar, colData = coldata, design= ~ condition + clean_up)
However this gives me the exact SAME PCA plot:
the inclusion of "clean up in the design option is making no difference.
In the DESeq2 manual, I also tried the R package sva which seems to infer batch effect in an unsupervised fashion:
dat <- counts(dds, normalized = TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx, ]
mod <- model.matrix(~ condition, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv = 2)
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + condition + clean_up
dds.sva=DESeq(ddssva)
So the design compares condition while controlling for SV1 SV2 and clean-up.
Again, this make no difference in any of the downstream functions for my dds object, including PCA, results, plotMA, or summary of the data.
My question is, is it possible in the data given to tell if I am doing something wrong in the pipeline? The inclusion of the "clean_up" variable I want to control for does not appear to make any difference, nor do columns caclulated by sva. Or alternatively, maybe they are being incorporated correclty but their effect is too small to make a difference in the data?
thanks so much.
Thank you Michael for getting back to me so quick.
Actually, when supplying the batch effect in the design it does indeed alter the outputs of the results function for my comparisons, and DEG numbers are more in line biologically with how I knew the study to proceed
One thing that I noticed is that the filter threshold specified in the results function varies widely between each timepoint. For example:
My question is, is it more appropriate to manually set a threshold in some way so that it is consistent across timepoints? Reason I ask is that (from my understanding in the manual), by the automatic threshold, some timepoints will be more powered than others given the portion of genes being considered for differential expression varies across timepoint, and will thus affect the FDR of each comparison.
thanks again.
You can filter at the beginning of the analysis eg by requiring X number of samples with a count of 10 or higher. And then set independentFiltering=FALSE.