Question

DESeq2 design formula

0

Entering edit mode

Elena • 0

@2475370b

Last seen 2.9 years ago

Germany

Hello everyone,

I have some questions about the design formula in DESeq2. I am still quite a beginner in R and have a large dataset to analyze so I have been trying different formulas to find which best describes my design layout, and I have some problems understanding which one is best.

So my experimental set-up is that I have 4 biological replicates and a time course. The baseline is time v0, and then 4 additional timepoints v1, v2, v3, v4 where we would expect something to happen at timepoints v2 and v3 and probably almost everything goes back to baseline at timepoint v4.

I have tried the design formula: ~ time + patient and the reduced formula ~ patient to see the overall effect of time, would that be the correct way to go?

Here I was quite confused, because when I change the colData description of the patients from numbers 1,2,3,4 to pt1, pt2, pt3, pt4 there seem to be different results

so with numerical patients 1, 2, 3, 4 the resultsNames(dds) gives out

[1] "Intercept" "time_v1_vs_v0" "time_v2_vs_v0" "time_v3_vs_v0" "time_v4_vs_v0" "patient"

with non-numerical patients, so pt1 .... , there are additional comparisons between pt1_vs_pt2 and so on ( and also different results --> different amount of significant genes)

Why is that and what would be the correct way to do it? I would assume that the character one would be right, since I see that being used everywhere.

Additionally, to test for timepoint-specific differences, would it be correct to use the non-reduced formula ~ time + patient and then contrast timepoint v1 vs timepoint v0 and do so for each time point, as long as afterward I correct for the additional testing?

And I actually have 10 different subsets from which I have done this whole thing, can I do a whole comparison with something like

~ subset + time + patient and then again the reduced formula ~ subset+patient and see the overall effects? Or how would I see the overall effects across subsets over time?

Thank you so much for any answers,

Best,

Elena

DESeq2 • 1.1k views

ADD COMMENT • link updated 3.0 years ago by Michael Love 42k • written 3.0 years ago by Elena • 0

0

Entering edit mode

When you're giving DESeq2 the numerical data for the time, I think it is treating this as a continuous variable. This is probably not what you want. This would work better for something like weight or height or body fat % or whatever.

When you at in the "pt" in front, now time is being treated as a categorical variable. This is probably closer to what you want if you're hoping to ask questions like "What genes are up at time 1 but not time 4?"

If you want to just put the numbers in, but treat it as a categorical variable, you should change that column in the sample matrix to a factor. Using dplyr syntax: samps_matx <- dplyr::mutate(samps_matx, time = factor(time))

I'm not sure what you mean about the different subsets.

ADD REPLY • link 3.0 years ago max.ferretti • 0

0

Entering edit mode

Ok yes, thank you so much for the quick response, I was thinking it would be something like that. Then I will just use the character labelling for the colData.

To the second part about the subsets

So I have

subset1 pt1 time0

subset1 pt1 time1

subset1 pt1 time2

....

subset2 pt1 time0 ....

and so on for subsets 1-10, pt 1-4 and time 0-4.

So 200 different samples total. Mostly I am interested in the differences within a subset, which is why I am looking at each subset just individually using ~ time + patient and then reduced ~ patient.

But I would also like to see the overall differences, so put all subsets together with ~ subset + time + patient and then the reduced model ~ subset+patient. Would that give me the time difference overall?

ADD REPLY • link 3.0 years ago Elena • 0

score 0 · Answer 1 · 2021-10-04

Why is that and what would be the correct way to do it? I would assume that the character one would be right, since I see that being used everywhere.

Questions about the appropriate statistical approach should be directed to a statistical collaborator. It's often possible to find someone at your institute with familiarity with linear models in R. You don't want to choose the analysis based on guessing what the design and interpretation of coefficients should be.