I have been testing how different design matrices and different ways of specifying a linear model via R's formula interface affect modeling results, within the context of differential expression analysis with DESeq2. I noticed that when I specified two models that from what I understand should decompose the data variance in equivalent ways, and then checked the corresponding coefficients from the two models that should have the same interpretation, the numerical values of the coefficients (log2FoldChange) and P values have some minor differences. However I could be wrong in my statistical claims above (I'm not a statistician). So I decided to seek expert help here to double check whether it's indeed just a numerical issue or maybe I missed something.
As a toy example, I'm using the same "pasilla" dataset used in the DESeq2 tutorial. The sample meta data is given below. The group
variable was manually created simply by combining condition
and type
:
condition type group
treated1 treated single treatedsingle
treated2 treated paired treatedpaired
treated3 treated paired treatedpaired
untreated1 untreated single untreatedsingle
untreated2 untreated single untreatedsingle
untreated3 untreated paired untreatedpaired
untreated4 untreated paired untreatedpaired
In the code snippet below I fit two models, one with interacting condition
and type
terms, the other simply with group
, which I think should be equivalent. The base levels of the factors were also specified manually, which allowed me to then check for coefficients from each model that should both correspond to the difference between the "treated-single" group and the "untreated-single" group (at least that's from what I understand). However, the resulting log2FoldChange values and P values have some minor differences, although the correlation across all genes are essentially 1. My question is: Is this simply a numerical issue caused by loss of precision during computation, or is there another technical cause, or was I wrong in my understanding of the models and their interpretations?
Thanks a lot in advance!
library(DESeq2)
library(pasilla)
# count matrix
cts <- data.matrix(read.csv(system.file("extdata", "pasilla_gene_counts.tsv", package="pasilla", mustWork=TRUE), sep="\t", row.names="gene_id"))
cts <- cts[, c(5:7, 1:4)]
# meta data
coldata <- read.csv(system.file("extdata", "pasilla_sample_annotation.csv", package="pasilla", mustWork=TRUE), row.names=1)
coldata <- coldata[, c("condition", "type")]
rownames(coldata) <- sub("fb", "", rownames(coldata))
coldata$condition <- factor(coldata$condition, levels=c("untreated", "treated"))
coldata$type <- factor(substr(coldata$type, 1, 6), levels=c("single", "paired"))
coldata$group <- with(coldata, factor(paste0(condition, type), levels=c("untreatedsingle", "treatedsingle", "untreatedpaired", "treatedpaired")))
# model 1
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ condition * type)
dds <- DESeq(dds)
res1 <- results(dds, contrast=c("condition", "treated", "untreated"))
# model 2
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ group)
dds <- DESeq(dds)
res2 <- results(dds, contrast=c("group", "treatedsingle", "untreatedsingle"))
# compare the results from the two models:
all.equal(res1$log2FoldChange, res2$log2FoldChange)
# [1] "Mean relative difference: 0.008926539"
all.equal(res1$pvalue, res2$pvalue)
# [1] "Mean relative difference: 0.001391931"
If you restrict yourself to, say, genes with a baseMean > 50, are the numerical differences between the two designs significant?
Well, if I look at genes with baseMean>50, then the relative absolute difference in log2FoldChange (say over the absolute log2FoldChange of the first model) is on average 0.0001. However, in the most extreme cases the log2FoldChange from the first model is about -0.2 while the second model gave log2FoldChange=0 (but these cases have insignificant P values).