Hi,
This question is inspired by several others which have been posted before, asking about how to think about multifactorial design. (Multifactorial experimental design in DESeq2 , DESeq2: with multiple factors and interaction terms won't show all effects )
A few things have been added to DeSeq2 recently, and are not very well documented yet, like the grouping function, and I sense some confusion about how to use it.
I've written this because it would be good to get more general advice about how to think about these problems, and to get biologically relevant results.
Here is an example design:
Genotype | Treatment3 | Treatment5 | Replicate |
LL | 3 | 5 | 1 |
LL | 3 | 5 | 2 |
LL | 3 | ctrl | 1 |
LL | 3 | ctrl | 2 |
LL |
control |
5 | 1 |
LL | control | 5 | 2 |
LL | control | ctrl | 1 |
LL | control | ctrl | 2 |
M | 3 | 5 | 1 |
M | 3 | 5 | 2 |
M | control | 5 | 1 |
M | control | 5 | 2 |
M | 3 | ctrl | 1 |
M | 3 | ctrl | 2 |
M | control | ctrl | 1 |
M | control | ctrl | 2 |
OP | 3 | 5 | 1 |
OP | 3 | 5 | 2 |
OP | 3 | ctrl | 1 |
OP | 3 | ctrl | 2 |
OP | control | 5 | 1 |
OP | control | 5 | 2 |
OP | control | ctrl | 1 |
OP | control | ctrl | 2 |
level orders:
LL,M,OP
control,3
ctrl, 5
1,2
It would be great to have some general advice on how to work with a dataset like this; for instance droplevels or not? Re-estimate size factors between queries? Should you do different groups and designs for each query, or do a maximum model which contains all resultsNames() you would need for all your desired comparisons? I appreciate these are more philosophical than practical questions, but it would be really helpful to know, especially for biologists who may not be super-familiar with GLM ,and don't intuitively know what is meant by things like "main effect", "effect for Celltype" or "Intercept"?
For each question (below) I'm interested in, I know some of them are quite "simple", but I really think it would help people understand better what to do with their data, and I've tried to formulate the questions in a "biological", rather than methematical terms - the type of questions my favourite biologists would ask me.
For each one, I'd like to know:
1. What is the best design to choose, and why? i.e. ~Genotype+Treatment3+Genotype:Treatment3 or ~Genotype+Genotype:Treatment3
2. How to export the result that you want using the results() function?
What genes are DE between LL and M?
What genes are DE between LL and (M and OP)?
What genes are DE between M and (LL and OP)?
What genes are DE between M and the main effect (LL,M,OP)?
What genes are DE between M and OP? (When I do this comparison should I drop LL samples and recalculate or not?)
What genes are DE between M(control,ctrl) and OP(control,ctrl)? (Should I use groups for this?)
What genes are DE between L(control,5) and OP(control,5)?
What genes are DE between control and treatment 3, controlling for the effect of Genotype and Treatment 5?
What genes are DE between control and treatment 3, not controlling for the effect of Genotype and Treatment 5?
What genes are DE between control and treatment 3 in M?
What genes are DE between control and treatment 3 in M, but not in OP?
What genes are DE between control and treatment 3 in M, and in OP, but have the opposite effect (i.e. up in M, and down in OP)?
What genes are DE between control and treatment 5 in M, but not in LL or OP?
Are there genes showing a synergistic effect of combining treatment 3 and treatment 5?
Are there genes showing a synergistic effect of combining treatment 3 and treatment 5, which is different between LL and OP (seen in one but not the others)?
Lastly, it is possible to do comparisons like:
results(dds, contrast=list("GenotypeM.Treatment3cntl","GenotypeO.Treatment55"))
...but I very much doubt that it gives a meaningful biological result. What to avoid?
And what is it actually that you get from asking for results like this (weird):
results(dds, contrast=list(
c("Treatment3cntl","GenotypeM.Treatment3cntl"),
c("Treatment3cntl","GenotypeO.Treatment3cntl")))
or this (possibly relevant)?
results(dds, contrast=list(
c("Treatment3cntl","GenotypeO.Treatment3cntl"),
c("Treatment33","GenotypeO.Treatment33")))
I know this is super-many questions (feel free to just answer a subset), but I feel that there are many people working with this type of datasets (I've got 3 going at the moment), and the current manual doesn't cover it very well, and the results man ?results covers it in mathematical terms, but not really in biological terms.
For example:
# the set Z effect compared to the average of set X and Y
# here we use 'listValues' to multiply the effect sizes for
# set X and set Y by -1/2
results(dds, contrast=list("setZ",c("setX","setY")), listValues=c(1,-1/2))
- Okay, so if I do that, what do I actually get as results? Genes which are differentially expressed between Z and the average of X and Y? or "the set Z effect compared to the average of set X and Y" - what does that mean?
Not that ?results is necessarily the best place to elaborate further, but perhaps this is a good place?
Cheers!!!