Hello everyone,
I have some questions about the design formula in DESeq2. I am still quite a beginner in R and have a large dataset to analyze so I have been trying different formulas to find which best describes my design layout, and I have some problems understanding which one is best.
So my experimental set-up is that I have 4 biological replicates and a time course. The baseline is time v0, and then 4 additional timepoints v1, v2, v3, v4 where we would expect something to happen at timepoints v2 and v3 and probably almost everything goes back to baseline at timepoint v4.
I have tried the design formula: ~ time + patient and the reduced formula ~ patient to see the overall effect of time, would that be the correct way to go?
Here I was quite confused, because when I change the colData description of the patients from numbers 1,2,3,4 to pt1, pt2, pt3, pt4 there seem to be different results
so with numerical patients 1, 2, 3, 4 the resultsNames(dds) gives out
[1] "Intercept" "time_v1_vs_v0" "time_v2_vs_v0" "time_v3_vs_v0" "time_v4_vs_v0" "patient"
with non-numerical patients, so pt1 .... , there are additional comparisons between pt1_vs_pt2 and so on ( and also different results --> different amount of significant genes)
Why is that and what would be the correct way to do it? I would assume that the character one would be right, since I see that being used everywhere.
Additionally, to test for timepoint-specific differences, would it be correct to use the non-reduced formula ~ time + patient and then contrast timepoint v1 vs timepoint v0 and do so for each time point, as long as afterward I correct for the additional testing?
And I actually have 10 different subsets from which I have done this whole thing, can I do a whole comparison with something like
~ subset + time + patient and then again the reduced formula ~ subset+patient and see the overall effects? Or how would I see the overall effects across subsets over time?
Thank you so much for any answers,
Best,
Elena
When you're giving DESeq2 the numerical data for the time, I think it is treating this as a continuous variable. This is probably not what you want. This would work better for something like weight or height or body fat % or whatever.
When you at in the "pt" in front, now time is being treated as a categorical variable. This is probably closer to what you want if you're hoping to ask questions like "What genes are up at time 1 but not time 4?"
If you want to just put the numbers in, but treat it as a categorical variable, you should change that column in the sample matrix to a factor. Using dplyr syntax: samps_matx <- dplyr::mutate(samps_matx, time = factor(time))
I'm not sure what you mean about the different subsets.
Ok yes, thank you so much for the quick response, I was thinking it would be something like that. Then I will just use the character labelling for the colData.
To the second part about the subsets
So I have
subset1 pt1 time0
subset1 pt1 time1
subset1 pt1 time2
....
subset2 pt1 time0 ....
and so on for subsets 1-10, pt 1-4 and time 0-4.
So 200 different samples total. Mostly I am interested in the differences within a subset, which is why I am looking at each subset just individually using ~ time + patient and then reduced ~ patient.
But I would also like to see the overall differences, so put all subsets together with ~ subset + time + patient and then the reduced model ~ subset+patient. Would that give me the time difference overall?