I have a multifactorial design containing an integer variable (‘age’) and a boolean factor (‘qr’) which represents treatment (TRUE) vs control (FALSE).
str(data.phenotypes)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 71 obs. of 3 variables:
$ Novogene_ID: chr "W01RDYL" "W03BLGR" "W03GR" "W03SL" ...
$ qr : Factor w/ 2 levels "FALSE","TRUE": 1 2 2 2 2 2 2 2 1 1 ...
$ age : int 20 11 23 25 24 24 24 23 9 23 ...
head(data.phenotypes)
# A tibble: 6 x 3
Novogene_ID qr age
<chr> <fct> <int>
1 W01RDYL FALSE 20
2 W03BLGR TRUE 11
3 W03GR TRUE 23
4 W03SL TRUE 25
5 W04BLOR TRUE 24
6 W04GRSL TRUE 24
str(data.gene.counts)
int [1:11313, 1:71] 198 185 557 496 118 435 760 4662 1159 68 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:11313] "gene4494" "gene4495" "gene4496" "gene4497" ...
..$ : chr [1:71] "W01RDYL" "W03BLGR" "W03GR" "W03SL" ...
>head(data.gene.counts[,1:5])
W01RDYL W03BLGR W03GR W03SL W04BLOR
gene4494 198 145 188 207 183
gene4495 185 85 114 161 145
gene4496 557 484 478 644 582
gene4497 496 412 402 544 449
gene4490 118 166 119 159 98
gene4491 435 390 424 641 478
Running DESeq2 using a simple additive design (~ age + qr) shows a large number of genes that are differentially expressed with age and a smaller number that are differentially expressed between the two conditions for qr, both of which results are expected:
dds = DESeqDataSetFromMatrix(countData = data.gene.counts,
colData = data.phenotypes,
design = as.formula("~ age + qr"))
dds.deg = DESeq(dds)
results = resultsNames(dds.deg)
resultstable = matrix(nrow = length(results), ncol = 1, dimnames = list(results,"DEGs"))
#loop through results and extract significant DEGs for each model term
for(i in 1:length(results)){
res = results(dds.deg,
name = results[i],
alpha = 0.05)
DEGs = (length(na.omit(which(res$padj<0.05))))
resultstable[results[i],"DEGs"] = DEGs
}
resultstable
DEGs
Intercept 10281
age 1683
qr_TRUE_vs_FALSE 283
What I'm finding confusing are the results when I add an interactions term: (~ age + gr + age:qr). We have strong a priori reason to think that there should be a large number of genes that respond differently to age in the treatment condition vs the control condition. If I understand correctly, this would be evidenced by a large number of genes responding to the age:qr interaction term. However, this isn't the result we see is very different from these expectations:
dds = DESeqDataSetFromMatrix(countData = data.gene.counts,
colData = data.phenotypes,
design = as.formula("~ age + qr + age:qr"))
dds.deg = DESeq(dds)
results = resultsNames(dds.deg)
resultstable2 = matrix(nrow = length(results), ncol = 1, dimnames = list(results,"DEGs"))
#loop through results and extract significant DEGs for each model term
for(i in 1:length(results)){
res = results(dds.deg,
name = results[i],
alpha = 0.05)
DEGs = (length(na.omit(which(res$padj<0.05))))
resultstable2[results[i],"DEGs"] = DEGs
}
resultstable2
DEGs
Intercept 10122
age 396
qr_TRUE_vs_FALSE 1
age.qrTRUE 0
Two things confuse me about these results:
No genes are identified as responding significantly to the interaction term. While it's possible that our hypothesis is simply wrong and there's genuinely no interaction between qr and age, this seems really surprising given our other analyses. For example, we observe a strongly age-contingent phenotypic response in the treatment condition that doesn't exist in the control condition, and it would be surprising if this weren't reflected at all in the gene expression data. I'd expect some genes to respond to the interaction, even if it were only a small number, which leads me to think that maybe I've somehow constructed or interpreted the model wrongly.
The number of genes identified as responding to age and to qr individually is greatly reduced after the addition of the interaction term. I appreciate that the interaction term is soaking up some of the DFs available, which is going to cost us some power to detect differential expression. However, it's strange to me that the addition of a single interaction term seems to completely remove our ability to detect genes that are differentially expressed between treatment and control conditions. Again, this makes me wonder whether something is going on with the model that I've failed to grasp.
Can anyone help me here? I can't for the life of me understand how I'm getting such radically different results simply from the addition of a single interaction term.
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.3 readr_1.3.1 tidyr_1.0.0
[7] tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.2.1 DESeq2_1.24.0 SummarizedExperiment_1.14.1 DelayedArray_0.10.0
[13] BiocParallel_1.18.1 matrixStats_0.55.0 Biobase_2.44.0 GenomicRanges_1.36.1 GenomeInfoDb_1.20.0 IRanges_2.18.3
[19] S4Vectors_0.22.1 BiocGenerics_0.30.0 edgeR_3.26.8 limma_3.40.6 lmerTest_3.1-0 lme4_1.1-21
[25] Matrix_1.2-17