Hi all,
I am getting acquainted with the analysis of interaction terms using DESeq2 and I am having some difficulty interpreting the results.
I am simulating data with 3 genotypes and 2 conditions as in the vignette:
http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions
I generated a simulated dataset comprising two sets of simulated genes, which I named (G1.1, G1.2,…, G1.10) and (G2.1, G2.2,…, G2.10).
Plots of the simulated data are presented in the Dropbox link:
https://www.dropbox.com/s/udoafy783uz2wve/Simulated_Data.pdf?dl=0
As described in the vignette, in the first set of simulated genes (G1.1, G1.2,…, G1.10), the condition effect is consistent across genotypes. Although condition A has a different baseline for I,II, and III, the condition effect is a log2 fold change of about 2 for each genotype. Using a model with an interaction term genotype:condition, the interaction terms for genotype II and genotype III should be nearly 0.
In the second set of simulated genes (G2.1, G2.2,…, G2.10), the condition effect is not consistent across genotype. Here the main condition effect (the effect for the reference genotype I) is again 2. However, this time the interaction terms should be around 1 for genotype II and -4 for genotype III.
I run DESeq2 using the code that I report below and I then applied the “results” function as described in Example 3 in its help page (Example 3: two conditions, three genotypes).
I obtained the following results.
# the interaction term for condition effect in genotype III vs genotype I.
# this tests if the condition effect is different in III compared to I
results(dds, name="genotypeIII.conditionB")
log2 fold change (MLE): genotypeIII.conditionB
Wald test p-value: genotypeIII.conditionB
DataFrame with 20 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
G1.1 1228.788 1.914718 0.07182542 26.65794 1.447982e-156 2.227664e-156
G1.2 1242.826 1.903526 0.07316239 26.01782 3.113072e-149 3.891339e-149
G1.3 1225.219 1.799119 0.07726639 23.28462 6.347051e-120 6.347051e-120
G1.4 1253.272 2.020963 0.07648760 26.42210 7.637409e-154 1.091058e-153
G1.5 1225.138 1.941133 0.07591755 25.56896 3.379010e-144 3.754455e-144
... ... ... ... ... ... ...
G2.6 820.8398 -1.929546 0.07612531 -25.34697 9.705092e-142 1.021589e-141
G2.7 816.8648 -2.106402 0.08039228 -26.20154 2.552091e-151 3.402788e-151
G2.8 814.0408 -2.006385 0.06825126 -29.39704 5.991026e-190 5.991026e-189
G2.9 801.4634 -2.033767 0.07095968 -28.66088 1.173338e-180 7.822253e-180
G2.10 814.8111 -2.143965 0.07620245 -28.13512 3.644536e-174 1.822268e-173
# the interaction term for condition effect in genotype III vs genotype II.
# this tests if the condition effect is different in III compared to II
results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))
log2 fold change (MLE): genotypeIII.conditionB vs genotypeII.conditionB
Wald test p-value: genotypeIII.conditionB vs genotypeII.conditionB
DataFrame with 20 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
G1.1 1228.788 2.488612 0.07002106 35.54091 1.148245e-276 3.827482e-276
G1.2 1242.826 2.478157 0.07138223 34.71671 4.408288e-264 8.816576e-264
G1.3 1225.219 2.301317 0.07557406 30.45115 1.156639e-203 1.156639e-203
G1.4 1253.272 2.607632 0.07472876 34.89463 8.969406e-267 2.242351e-266
G1.5 1225.138 2.501794 0.07406920 33.77644 4.374322e-250 5.467902e-250
... ... ... ... ... ... ...
G2.6 820.8398 -2.424637 0.07427403 -32.64447 9.599401e-234 1.010463e-233
G2.7 816.8648 -2.585588 0.07869325 -32.85655 9.184451e-237 1.020495e-236
G2.8 814.0408 -2.547367 0.06625633 -38.44716 0.000000e+00 0.000000e+00
G2.9 801.4634 -2.462004 0.06894913 -35.70754 3.019791e-279 1.509896e-278
G2.10 814.8111 -2.555934 0.07431678 -34.39242 3.272752e-259 5.950458e-259
It is not clear to me what are the log2FoldChange and pvalue columns referring to; am I doing anything wrong? How can I find information about the interaction terms for genotype II and genotype III being nearly 0 for the first set of genes and around 1 for genotype II and -4 for genotype III for the second set of genes?
Thank you for your attention.
Best,
Daniele Muraro
Code:
library("DESeq2")
library("ggplot2")
npg <- 20
mu <- 2^c(8,10,9,11,10,12)
condition <- rep(rep(c("A","B"),each=npg),3)
genotype <- rep(c("I","II","III"),each=2*npg)
countData <- data.frame()
N_G1 = 10
for(i in 1:N_G1){
counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01)
countData <- rbind(countData, as.integer(counts))
}
colnames(countData) <- paste0("S",seq(1,dim(countData)[2]))
mu[4] <- 2^12
mu[6] <- 2^8
N_G2 = 10
for(i in 1:N_G2){
counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01)
countData <- rbind(countData, as.integer(counts))
}
Names_G1 <- paste0(rep("G1.", N_G1), seq(1,N_G1))
Names_G2 <- paste0(rep("G2.", N_G2), seq(1,N_G2))
rownames(countData) <- c(Names_G1, Names_G2)
colData <- as.data.frame(cbind(condition, genotype))
rownames(colData) <- colnames(countData)
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~ genotype + condition + genotype:condition)
dds <- DESeq(dds, fitType='local')
plot_log2fc <- function(d, title) {
ggplot(d, aes(x=condition, y=log2c, group=genotype)) +
geom_jitter(size=1.5, position = position_jitter(width=.15)) +
facet_wrap(~ genotype) +
stat_summary(fun.y=mean, geom="line", colour="red", size=0.8) +
xlab("condition") + ylab("log2(counts+1)") + ggtitle(title)
}
log2c_all <- t(log2(counts(dds) + 1))
g_names = rownames(dds)
all_plots <- list()
pdf("Simulated_Data.pdf")
for(g_name in g_names){
log2c = log2c_all[,g_name]
d <- data.frame(log2c = log2c, condition = colData(dds)$condition, genotype = colData(dds)$genotype)
all_plots[[g_name]] <- plot_log2fc(d, g_name) + ylim(7,13)
print(all_plots[[g_name]])
}
graphics.off()
resultsNames(dds)
# the interaction term for condition effect in genotype III vs genotype I.
# this tests if the condition effect is different in III compared to I
results(dds, name="genotypeIII.conditionB")
# the interaction term for condition effect in genotype III vs genotype II.
# this tests if the condition effect is different in III compared to II
results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))
Hi Michael,
Thank you very much for your helpful reply.
I have tried to include simulated null genes (G3.1,...,G3.100) and the p-values and log2 fold changes associated with the interaction terms are now what I would expect: the interaction terms for genotype II and genotype III are nearly 0 when calculated for the gene set G1.1,...G1.10 and around 1 for genotype II and -4 for genotype III when calculated for the gene set G2.1,...G2.10. Because of the character limit, I will report the updated code and plots in the next post.
Am I understanding well that defining the design
dds$group <- factor(paste0(dds$genotype, dds$condition))
design(dds) <- ~ group
and applying
results(dds, contrast=c("group", "IIIB", "IIB"))
is equivalent to define the design
design(dds) <- ~ genotype + condition + genotype:condition
and apply
results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB")) ?
Thanks again.
Best,
Daniele
Below the updated code and plots.
Best,
Daniele
Updated Plots:
https://www.dropbox.com/s/dtu6h4fru5xl5mk/Simulated_Data_With_Null_Genes.pdf?dl=0
Updated Code:
library("DESeq2")
library("ggplot2")
npg <- 20
mu <- 2^c(8,10,9,11,10,12)
condition <- rep(rep(c("A","B"),each=npg),3)
genotype <- rep(c("I","II","III"),each=2*npg)
countData <- data.frame()
N_G1 = 10
for(i in 1:N_G1){
counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01)
countData <- rbind(countData, as.integer(counts))
}
colnames(countData) <- paste0("S",seq(1,dim(countData)[2]))
mu[4] <- 2^12
mu[6] <- 2^8
N_G2 = 10
for(i in 1:N_G2){
counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01)
countData <- rbind(countData, as.integer(counts))
}
mu <- 2^c(8,8,9,9,10,10)
N_G3 = 100
for(i in 1:N_G3){
counts <- rnbinom(6*npg, mu=rep(mu,each=npg), size=1/.01)
countData <- rbind(countData, as.integer(counts))
}
Names_G1 <- paste0(rep("G1.", N_G1), seq(1,N_G1))
Names_G2 <- paste0(rep("G2.", N_G2), seq(1,N_G2))
Names_G3 <- paste0(rep("G3.", N_G3), seq(1,N_G3))
rownames(countData) <- c(Names_G1, Names_G2, Names_G3)
colData <- as.data.frame(cbind(condition, genotype))
rownames(colData) <- colnames(countData)
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~ genotype + condition + genotype:condition)
dds <- DESeq(dds, fitType='local')
# Plots of one sample for each simulated gene set
plot_log2fc <- function(d, title) {
ggplot(d, aes(x=condition, y=log2c, group=genotype)) +
geom_jitter(size=1.5, position = position_jitter(width=.15)) +
facet_wrap(~ genotype) +
stat_summary(fun.y=mean, geom="line", colour="red", size=0.8) +
xlab("condition") + ylab("log2(counts+1)") + ggtitle(title)
}
log2c_all <- t(log2(counts(dds) + 1))
g_names = c('G1.1','G2.1','G3.1')
all_plots <- list()
pdf("Simulated_Data_With_Null_Genes.pdf")
for(g_name in g_names){
log2c = log2c_all[,g_name]
d <- data.frame(log2c = log2c, condition = colData(dds)$condition, genotype = colData(dds)$genotype)
all_plots[[g_name]] <- plot_log2fc(d, g_name) + ylim(7,13)
print(all_plots[[g_name]])
}
graphics.off()
resultsNames(dds)
# the interaction term for condition effect in genotype II vs genotype I.
# this tests if the condition effect is different in II compared to I
res21 <- results(dds, name="genotypeII.conditionB")
res21[grepl('G1.', rownames(res21)),]
res21[grepl('G2.', rownames(res21)),]
# the interaction term for condition effect in genotype III vs genotype I.
# this tests if the condition effect is different in III compared to I
res31 <- results(dds, name="genotypeIII.conditionB")
res31[grepl('G1.', rownames(res31)),]
res31[grepl('G2.', rownames(res31)),]
# the interaction term for condition effect in genotype III vs genotype II.
# this tests if the condition effect is different in III compared to II
res32 <- results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))
res32[grepl('G1.', rownames(res32)),]
res32[grepl('G2.', rownames(res32)),]
Thank you very much for your availability; it was very helpful.
Best,
Daniele