I am working on a more complex design, but I think a basic 2-factor (each with two levels) captures the spirit of my question. Namely, I am interested in creating "reasonable" mock data to appropriately test my understanding of extracting the appropriate contrasts in DESeq2's results
method.
In the script below, I create some mock data using DESeq2's makeExampleDESeqDataSet
method. I then perturb the several genes by quite a bit, "spiking" in the effects I hope to find upon running it through a differential expression analysis.
My hypothetical setup takes 2 factors (genetic background and treatment) at two levels each:
background
factor has levels: sgCtrl, sgFootreatment
factor has levels: DMSO, T1
There are 3 replicates in each combination of levels for a total of 12 total samples.
I model using the main + interaction: ~ background + treatment + background:treatment
resultsNames(dds)
produces the following:
[1] "Intercept" "background_sgFoo_vs_sgCtrl"
[3] "treatment_T1_vs_DMSO" "backgroundsgFoo.treatmentT1"
Now, running results(dds, name='background_sgFoo_vs_sgCtrl')
, I recover the main effect on background
that I introduced, as expected. That is, the genes I perturbed are all near the top of the sorted list.
However, I cannot seem to recover any interaction term using res <- results(dds, name="backgroundsgFoo.treatmentT1")
. For reference, here are the raw counts for the gene where I introduced the interaction effect:
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
116 171 157 130 24 125 433 567
sample9 sample10 sample11 sample12
497 844 838 864
Even just visually, you can see the main background
effect on samples 7-12 and the additional effect on samles 10-12, which I thought should yield a very obvious interaction effect. Yet that particular gene gets raw p-value 0.017 and FDR of 0.81, despite me introducing a rather large perturbation. Is this just an artifact of my spiking procedure? Or am I missing something?
Question/clarification 2
Given the resultsNames
above, I can identify the following coefficients:
- B0 = intercept (representing sgCtrl + DMSO here)
- B1 = main effect of background (e.g. comparing sgFoo+DMSO vs. sgCtrl+DMSO)
- B2 = main effect of treatment (e.g. comparing sgCtrl+T1 vs. sgCtrl+DMSO)
- B3 = interaction effect (of T1 with sgFoo)
Then, if I was interested in the effect of treatment T1 in sgFoo versus treatment T1 in sgCtrl, I would have:
(B0+B1+B2+B3) - (B0 + B2) = B1+B3 and would use results(dds, contrast=c(0,1,0,1))
, correct? I recognize this has the main background
effect B1, so the interpretation is a little more nuanced than comparing genetic backgrounds both under DMSO condition.
Thanks for any help/advice/etc.!
Script (DESeq2_1.26.0 but can provide full sessionInfo if necessary):
library(DESeq2)
set.seed(0)
mock_dds <- makeExampleDESeqDataSet(n=10000, m=12)
# First extract the raw counts
raw_counts = counts(mock_dds)
print('Original raw counts:')
print(head(raw_counts))
# setup annotations
background = rep(c("sgCtrl","sgFoo"),each=6)
treatment = rep(rep(c("DMSO","T1"),each=3),2)
sample_annotations = cbind(background, treatment)
rownames(sample_annotations)<-paste0('sample', 1:12)
# By setup, there should be no true effects so far.
# For gene1, gene3, and gene5, add an effect due to the genetic background.
# To do this get the row mean for each of those genes and simply add that onto
# the sgFoo samples (for both DMSO and T1 treated conditions)
selected_genes = c('gene1', 'gene3', 'gene5')
subset_means = floor(rowMeans(raw_counts[selected_genes,]))
# create arrays for the samples we will eventually select:
sgFoo_samples = paste0('sample', 7:12)
sgFoo_plus_treated_samples = paste0('sample', 10:12)
# Add the perturbation to the sgFoo samples (sample7-sample12). Here, we add a 3-fold increase:
raw_counts[selected_genes, sgFoo_samples] = raw_counts[selected_genes, sgFoo_samples] + 3*subset_means
# Add an additional perturbation for only gene3 on the sgFoo+treated samples (sample10-sample12)
raw_counts['gene3', sgFoo_plus_treated_samples] = raw_counts['gene3', sgFoo_plus_treated_samples] + 3*subset_means['gene3']
print('Perturbed raw counts:')
print(head(raw_counts))
# Run DESEq2:
dds <- DESeqDataSetFromMatrix(countData = raw_counts,
colData = sample_annotations,
design = ~background+treatment+background:treatment)
dds <- DESeq(dds)
print(resultsNames(dds))
# To get the background effect imposed on sgFoo--
res <- results(dds, name='background_sgFoo_vs_sgCtrl')
resOrdered <- res[order(res$pvalue),]
print(head(resOrdered))
print('*****************')
# Check the interaction effect to see if gene3 pops up:
res <- results(dds, name="backgroundsgFoo.treatmentT1")
resOrdered <- res[order(res$pvalue),]
print(head(resOrdered))
print(resOrdered['gene3',])
I appreciate the perspective-- I was quite surprised at how large the effect size had to be. I do get a lot of questions regarding treatment effects that are very similar to this query and this makes it seem like it is almost impossible to detect them in the setting of typical 3 vs. 3 cell-line experiments.