Hello,
I have a question about setting up interaction terms in DESeq2 following the recent update. My experiment consists of 42 total samples, from 14 donors, across three different cell populations, and two different disease states (healthy control vs RA). My sample block is as follows:
patient population disease x1 CM HC x1 EM HC x1 Na HC x2 CM HC x2 EM HC x2 Na HC
.....
I want to understand the differences between disease and healthy samples. My design is:
~disease + disease:population + disease:patient
Previously, I used the following contrast but received the following error:
> res<-results(dds, contrast=list(c("diseaseRA","diseaseRA.populationCM"),c("diseaseHC","diseaseHC.populationCM"))) Error in checkContrast(contrast, resNames) : all elements of the contrast as a list of length 2 should be elements of 'resultsNames(object)'
I saw in a previous thread that I can set up a numerical contrast as well. So I tried this approach:
n=model.matrix(~disease+disease:population+disease:patient,samples) ctrst = colMeans(n[samples$disease == "RA" & samples$population == "CM",]) - colMeans(n[samples$disease == "HC" & samples$population == "CM",]) res<-results(dds, contrast=ctrst)
This new model worked, but the results were slightly off from my run in a previous DESeq2 version:
Current run:
baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> chr1_713980_714315 7.711915 0.3961093 0.4481465 0.8838834 3.767592e-01 0.585865655 chr1_840122_840415 17.376991 0.4691477 0.2750717 1.7055470 8.809246e-02 0.241106534 chr1_856580_856838 8.218786 1.0523656 0.4238437 2.4829094 1.303142e-02 0.072174032
Previous run:
baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> chr1_713980_714315 7.711915 0.3549742 0.4386707 0.8092043 4.183976e-01 0.62903224 chr1_840122_840415 17.376991 0.4412814 0.2698125 1.6355112 1.019419e-01 0.27137040 chr1_856580_856838 8.218786 1.0165040 0.4156541 2.4455524 1.446304e-02 0.08100218
Am I doing the contrast correctly? If so, why are the values slightly different from my previous run? Is it because DESeq2 now turns off log2FC shrinkage for all terms if an interaction term is present? Thank you in advanced for the help. My session info is attached below.
Best,
Dave
> sessionInfo() R version 3.2.4 Revised (2016-03-16 r70336) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.4 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] cluster_2.0.3 BiocParallel_1.4.3 VennDiagram_1.6.16 futile.logger_1.4.1 ggplot2_2.1.0 plyr_1.8.3 [7] RColorBrewer_1.1-2 gplots_2.17.0 DESeq2_1.10.1 RcppArmadillo_0.6.600.4.0 Rcpp_0.12.3 SummarizedExperiment_1.0.2 [13] Biobase_2.30.0 GenomicRanges_1.22.4 GenomeInfoDb_1.6.3 IRanges_2.4.8 S4Vectors_0.8.11 BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] XVector_0.10.0 bitops_1.0-6 futile.options_1.0.0 tools_3.2.4 zlibbioc_1.16.0 rpart_4.1-10 RSQLite_1.0.0 annotate_1.48.0 [9] gtable_0.2.0 lattice_0.20-33 DBI_0.3.1 gridExtra_2.2.1 genefilter_1.52.1 caTools_1.17.1 gtools_3.5.0 locfit_1.5-9.1 [17] nnet_7.3-12 AnnotationDbi_1.32.3 XML_3.98-1.4 survival_2.38-3 foreign_0.8-66 gdata_2.17.0 latticeExtra_0.6-28 Formula_1.2-1 [25] geneplotter_1.48.0 lambda.r_1.1.7 Hmisc_3.17-2 scales_0.4.0 splines_3.2.4 xtable_1.8-2 colorspace_1.2-6 labeling_0.3 [33] KernSmooth_2.23-15 acepack_1.3-3.3 munsell_0.4.3
There is a statistical problem with trying to control for patient effects here using fixed effects while comparing directly across different groups of patients, which is that these terms are collinear in the linear model.
Here is a more simplified case of trying to compare control and disease:
patient 1 (Ctrl), patient 2 (Ctrl), patient 3 (Dis), patient 4 (Dis).
You cannot control for patient effects (1,2,3,4) using fixed effects in a linear model and simultaneously compare across disease and control. The contrast (3+4) - (1+2) is the same as (Dis) - (Ctrl), so you can see that there cannot be a unique solution to the linear model.
There are two options for going forward, which I recommend in cases like this.
You can use limma-voom, which has a function duplicateCorrelation() that allows you to include the patient information as a known correlation in the model.
Or with DESeq2 you can use a design which doesn't include the patient effect, but will let you compare RA vs HC within each population. And here you are just not informing the model of the correlation between the 3 population samples from the same patient. The approach you should take for DESeq2 is to combine disease and population into a new factor (say, "group") using factor() and paste0() as can be seen in the vignette, and then use a design ~group. Then you can contrast RACM vs HCCM easily using character-style contrasts.
It's up to you which approach you go for. You may want to consult with a local statistician who can describe the approaches in more detail.
I see, thanks for the clarification Michael. I will rethink how I design this specific model. I appreciate the help.