Hi,
I'm currently exploring the functions in DESeq2 and try to replicate some findings of DOI: 10.1126/science.aan3456. In this paper, they test several models to identify clusters of differentially expressed genes between 3 species (H[uman]/C[himp]/M[acaque]). Some of their models they test are as followed: Model 0: count ~ Batch Model H: count ~ species + batch (2 levels of species: human, chimpanzee/macaque) Model A: count ~ species + batch (3 levels of species: human, chimpanzee, macaque)
To identify clusters of genes where H > C > M, they first test Model H vs reduced model 0. This also works fine in my code, as I can easily drop the complete factor species. However, I can't figure out how to combine chimpanzee/macaque, ultimately having something like:
dds <- DESeqDataSetFromMatrix(countData=counts, colData=metaData, design=~Batch + ModelA)
dds <- DESeq(dds, test="LRT", reduced = ~Batch + ModelH)
As this gives the error:
Error in checkLRT(full, reduced) :
the following variables in the reduced formula not in the full formula: ModelH
Does anyone have an idea how to reduce the factor specie from 3 to 2 for the reduced model, so I can compare these 2 models?
Thanks in advance!
Thank you for your fast comment, and excuse me for my belated reply!
I have been working on this for the last few weeks and came across some errors in my own code. I am still having doubts on my final comparison, as I want to include batch effects in my models as well.
My first design is: ~Batch+Species (3 levels, human/chimp/mac), vs reduced ~Batch. - This is to identify any gene showing differential expression between the 3 species.
On these identified genes I do post-hoc testing. To do so I added the 2 factors in my metadata as you suggested [species 1, species2], such that human = [1 1], chimp = [1 2], macaque = [2 2].
My first test is to differentially expressed genes in humans versus the combined group. Full model = ~Batch + species1; reduced = ~Batch.
My second post hoc comparison however is to identify the non significant genes between the 3 species comparison controlled for batch effects, and the human vs 2 species. model.matrix(~Batch + Species1 + species2) however returns that the model is not full rank,
Therefore I used the following approach: - Full model: ~Batch + species1 + species2 - Reduced: ~Batch + species1 and extract genes > FDR padj value. However, I am in doubt if I am using this comparison as you have illustrated it.
I'm looking forward to your reply and want to thank you again for your help!
If you want to assess if a particular LFC is close to zero, we have a framework for a minimum effect size using lfcThreshold and altHypothesis (see documentation). Here you have to define “close” — you have to come up with an LFC that is close enough to zero for your application.
Hi,
Again thanks for the quick reply. It's not the LFC per se, but the models used in the second post hoc comparison:
using an LRT test i do: Full model: ~Batch + species1 + species2 vs Reduced: ~Batch + species1
and results are res <- results(dds,contrast = c("species2","1","2"))
From these results I extract > padj. It questions me if this is the correct way for a comparison of the 3 species (as represented in the full model by the combination of species 1 & species2) vs a 2 species comparison in the reduced model.
This is just an LFC (coefficient) though. Adjusted p-value for this coefficient being large doesn't really represent anything. However, we do have a framework for evidence that this coefficient is between -theta and theta, with FDR bound etc.
Interesting! Another approach thus could also be to:
Or do I misunderstand you?
I don't recommend this -- it's best to avoid interpreting p > alpha or adjusted p > alpha as evidence for something.
Thank you for your answer!
As I am trying to replicate the original findings of the paper, they state that their thresholds are the FDR and the 'mean'. In further analyses/illustrations they highlight 2-fold changes. So the approach as described seems to be replicating their approach I believe.
I do understand your argument on the p-values. Will definitely take that into account on further analyses!
Glad to hear that the 3-group approach with species1 + species2 + batch effects seems the right one for the 3 species comparison. Thank you for your help!