I read the code of DESeq2 roughly and realized that for each gene, one dispersion estimate is used across all samples. My work is to compare the gene expression profiles of mice from 2 different genetic backgrounds (WT and KO). Certain genes are expected to be highly expressed in one genotype but not expressed in the other. I'm curious about how DESeq2 would estimate 1 dispersion value that would fit both genotypes.
I read from the DESeq2 vignette that my design formula will affect the way dispersion is calculated, but no details were given. I couldn't find details regarding how DESeq2 handles this from the reference manual either. I'm not very familiar with bayesian models so I decided to reach out here instead of digging through the code.
The type of sample I receive looks like this:
df <- data.frame(sample = paste0("s", 1:6),
genotype = c("WT", "KO", "WT", "KO","WT", "WT" ),
batch = paste0("batch", c(1,1,2,2,3,3)))
> df
sample genotype batch
1 s1 WT batch1
2 s2 KO batch1
3 s3 WT batch2
4 s4 KO batch2
5 s5 WT batch3
6 s6 WT batch3
The current "flaw" is that in the batch3
both are WT, because the experiment is not done yet and I'll have more batches coming.
To construct a generalized linear model, it seems reasonable to have the design formula like this: ~ batch + genotype
. From my understanding of glm, the order of batch
and genotype
in the design formula shouldn't matter. However, DESeq2 vignette suggests putting the most important variable in the last place. I assume this is for the purpose of dispersion estimation?
Also I think with the design formula ~batch + genotype
, batch3
would be useless in the estimation of differential expression between genotypes since the differences are regressed out in the "batch" part, is that correct?
I would highly appreciate any input on this! Thanks~
“the vignette suggests putting the most important variable in the last place”
See the section, “Note on factor levels”. It only matters in that, when you call results(dds) with no other argument, the software has to guess which coefficient you want a results table for. It uses by default (only when no other information is provided) the last variable in the design, which is typical for R and Bioc packages.