Formula design
We are struggling in understanding the correct design formula for my rnaseq analysis. We have 16 samples, 4 biological replicates with 2 different variables: if they are treated with a drug or not, and if they are stressed or not.
We want to know: a) Which is the effect of the treatment under basal conditions. b) Which is the effect of the treatment under stressed conditions.
During the exploratory analysis, we realized that the individual differences between biological replicates were bigger than than the ones due to treatment, or infection (i.e. the samples cluster together by biological replicate and not by treatment). We are not interested in quantifying these differences, but on normalizing for them (kinda like a paired analysis?). My question is the following:
Given that the 'genotype' (or 'family') effect will be inherent in all the samples (see colData below), should I take it into account when doing the design matrix?
i.e., my colData would look like this:
genotype treatment stress
1 A untreated stressed
2 A treated stressed
3 A untreated No
4 A treated No
5 B untreated stressed
6 B treated stressed
7 B untreated No
8 B treated No
9 C untreated stressed
10 C treated stressed
11 C untreated No
12 C treated No
13 D untreated stressed
14 D treated stressed
15 D untreated No
16 D treated No
If I'm understanding previous answers right, I should normalize by 'genotype', and the technical process would be similar of dealing with having a 'batch' effect, being every genotype a 'batch'. In this case, should our design look like this?
design = ~ treatment + treatment:genotype + treatment:stress
(This would be similar to this case: https://support.bioconductor.org/p/92108/)
Or should I collapse 'treatment' and 'stress' into a grouped 'condition' variable as suggested here (https://support.bioconductor.org/p/67600/#67612), and input the design:
design = ~ genotype + condition
I would then take out the results that we are interested with
## Which is the effect of the treatment under basal conditions.
results(dds, contrast=c("condition","treatedNo", "untreatedNo"))
## Which is the effect of the treatment under stressed conditions.
results(dds, contrast=c("condition","treatedstressed", "untreatedstressed"))
But I am not able to see how I would see the "extra" effect of the treatment under stressed conditions 'relative' to being stressed in this case.
Thank you so much for your help
Thanks for your response Michael.
However, the design
~genotype + stress:treatment
is not possible as the model matrix is not full rank . I guess that I am not fully understanding the linearity here because with my colData (in the original post) I cannot relate to any of the examples on the DESeq2 vignette. There is one copy of each 'genotype' in each group, and they are not nested within groups.The designs
~genotype + stress + treatment + stress:treatment
, or~genotype + treatment + stress:treatment
work, though. Maybe one of those was the one that you meant?Thank you so much for your help, and have a nice weekend,
PS: I am running
and my sessionInfo() is:
Apologies, I had a typo above and left out the base level.
Thanks for your help!