Hello fellow DESeq2 users,
I am confused about the design of an experiment and how I should implement it when using DESeq2. The experiment contains a total of 27 samples, with 3 conditions (1 control and 2 experimental) at 3 different time points. Each condition at a specific time point has 3 replicates. Initially, I loaded all of the samples into a DESeqDataSet object and ran DESeq() on this. For specific comparisons (between control and experimentals at same times), I used the 'contrast' argument with the 'results' function. This is basically the same as explained in the ?results and is as follows:
dds$group <- factor(paste0(dds$stage, dds$condition))
design(dds) <- ~ group
dds <- DESeq(dds)
resultsNames(dds)
res <- results(dds, contrast=c("group", "IIIB", "IIIA"))
summary(res) #condition effect for stage III between A and B
out of 40843 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 45, 0.11%
LFC < 0 (down) : 200, 0.49%
outliers [1] : 98, 0.24%
low counts [2] : 11798, 29%
(mean count < 6)
Reading through the vignette, I notice that using an interaction term may be appropriate for these comparisons as well. This will be to test if the 3 different condition effects differ across 3 stages.
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sTable, directory = "", design = ~stage + condition + stage:condition)
dds$condition <- relevel(dds$condition, ref=cond[1])
res <- results(dds, alpha = 0.05, contrast = c("condition", "B", "A"))
summary(res) #condition effect for stage III between B and A
out of 40843 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 44, 0.11%
LFC < 0 (down) : 197, 0.48%
outliers [1] : 97, 0.24%
low counts [2] : 11014, 27%
(mean count < 4)
Even though I am doing the same comparison, I am getting different numbers using the 'summary()' command. So my question is - what are the key differences between using an interaction term in this case or just using the grouping variable? Specifically explaining what the second to last paragraph of the vignette means. I apologize for my limited statistical background.
Also, I know using all samples from an experiment is recommended given the information sharing properties of DESeq2 for dispersion estimates, etc., although I have seen cases in the examples when the DESeqDataSet has been subsampled to keep conditions at a specific time before running DESeq(). When I subsample for a specific time point and compare the control and experimental, I get less DE genes then running DESeq() without subsampling for a specific time point. Why would one do this? Don't we want to keep all the sampled in the DESeqDataSet and then use 'contrast' to get specific comparisons?
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DESeq2_1.16.1 SummarizedExperiment_1.6.3
[3] DelayedArray_0.2.7 matrixStats_0.52.2
[5] Biobase_2.36.2 GenomicRanges_1.28.4
[7] GenomeInfoDb_1.12.2 IRanges_2.10.2
[9] S4Vectors_0.14.3 BiocGenerics_0.22.0
loaded via a namespace (and not attached):
[1] genefilter_1.58.1 locfit_1.5-9.1 splines_3.4.0
[4] lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6
[7] base64enc_0.1-3 blob_1.1.0 survival_2.41-3
[10] XML_3.98-1.9 rlang_0.1.1 DBI_0.7
[13] foreign_0.8-69 BiocParallel_1.10.1 bit64_0.9-7
[16] RColorBrewer_1.1-2 GenomeInfoDbData_0.99.0 plyr_1.8.4
[19] stringr_1.2.0 zlibbioc_1.22.0 munsell_0.4.3
[22] gtable_0.2.0 htmlwidgets_0.9 memoise_1.1.0
[25] latticeExtra_0.6-28 knitr_1.16 geneplotter_1.54.0
[28] AnnotationDbi_1.38.1 htmlTable_1.9 Rcpp_0.12.12
[31] acepack_1.4.1 xtable_1.8-2 scales_0.4.1
[34] backports_1.1.0 checkmate_1.8.3 Hmisc_4.0-3
[37] annotate_1.54.0 XVector_0.16.0 bit_1.1-12
[40] gridExtra_2.2.1 ggplot2_2.2.1 digest_0.6.12
[43] stringi_1.1.5 grid_3.4.0 tools_3.4.0
[46] bitops_1.0-6 magrittr_1.5 RSQLite_2.0
[49] lazyeval_0.2.0 RCurl_1.95-4.8 tibble_1.3.3
[52] Formula_1.2-2 cluster_2.0.6 Matrix_1.2-10
[55] data.table_1.10.4 rpart_4.1-11 nnet_7.3-12
[58] compiler_3.4.0
Hi Micheal, thanks for the quick response. I mistakenly got the output of the summary(res) mixed up, please note the change. The FAQ is very helpful for my latter question. Looking at the dispersion plots I can see the difference between using a subsample and the full dataset. It is also appropriate to use the full set of samples since there appears to be no huge difference in variability between time points.
When comparing the resulting gene sets of each test I there are two unique genes in the gene set resulting from using the interaction term (IT). These were excluded from the gene set resulting from using the grouping variable (GV) because one was marked as an outlier and the other was marked as filtered out due to independent filtering. Genes unique to the GV gene set occurred at the bottom of this gene set (ranked by padj) and simply did not make the .05 threshold in the IT gene set.