Hi everyone, I have theoretical questions about the workflow that would better fit my RNAseq data. I am working on a dataset with a total of 15 samples. The experiment consists of a single strain of a fungus exposed (treatment) or not (control) to a fungicide. Two sets of "control" and "treatment" samples were picked, at 1 day and 4 days, respectively. Below is the coldata object.
condition day
T1A13 Control 1
T1B13 Control 1
T1C13 Control 1
T1D13 Control 1
T1VA13 Treatment 1
T1VB13 Treatment 1
T1VC13 Treatment 1
T1VD13 Treatment 1
T4A13 Control 4
T4B13 Control 4
T4C13 Control 4
T4VA13 Treatment 4
T4VB13 Treatment 4
T4VC13 Treatment 4
T4VD13 Treatment 4
I went through the DESeq2 vignette, as well as this guide, but I'm almost sure I'm getting something wrong. I would like to specify that the samples picked at day 4 are not the same as those sampled at day 1: in other words, all the samples in the dataset are different individuals. My main biological question is: are some genes impacted by the treatment initially, but then up-regulated as a consequence of prolonged exposure? The part of the vignette explaining the "overall" (I have difficulties in getting the meaning of "overall" here) effect of condition over genotype goes with:
~genotype + condition
while the "main" effect of condition on the reference genotype is summarized by:
~genotype + condition + genotype:condition
Therefore, in my case I would use
~day + condition + day:condition
to answer my question, right? My doubt is: with this formula, am I assuming a priori that the effect of "condition" depends on "day"?
After this, should I just run:
results(dds, contrast=c("condition", "treatment", "control"))
since I am already considering "day" to have an influence on the data? Or should I still contrast, separately, the different groups:
Levels: Control1 Control4 Treatment1 Treatment4
?
I saw that most guides on time series report >=3 time points, so another question is: can I use a "time series" workflow with just two time points?
Thank you all in advance!
Many thanks for the reply.
I still have a doubt: in the
?results
examples, I see thatresults(dds, contrast=c("condition","B","A"))
gives the effect of condition for genotype I, which would be day 1 in my case. You suggest something similar toresults(dds, name="genotypeII.conditionB")
, which would beresults(dds, name="day4.conditionTreatment")
in my case. This gives me 678 DEGs. Instead if, with the same design, I use:I get 667 DEGs. Should I call DESeq with the LRT test, and then get the padj values with
results(dds, name="day4.conditionTreatment")
? Or should I use "Wald", and then call the results in the same way? I understand that the 678 vs 667 is not that different, but I would like to understand which is the best way to proceed. Many thanks again for your help.