Hi everyone, i have an asymmetrical time-course experiment. Timepoints are not equal for each "treatment" or "condition". In fact the T0 samples are common for all the conditions (AFT0R2, AFT0R3, AP_T0R1) and timepoints then varies for each condition. AP has two timepoints (T1 and T2), AF has three timepoints (T1, T2, T3) and UF has only one (T1). The coldata looks like this:
Sample Condition Time Replicate
AF_T0R2 CP 0 2
AF_T0R3 CP 0 3
AP_T0R1 CP 0 1
AF_T1R1 AF 1 1
AF_T1R2 AF 1 2
AF_T1R3 AF 1 3
AF_T2R1 AF 2 1
AF_T2R2 AF 2 2
AF_T2R3 AF 2 3
AF_T3R1 AF 3 1
AF_T3R2 AF 3 2
AF_T3R3 AF 3 3
AP_T1R1 AP 1 1
AP_T1R2 AP 1 2
AP_T1R3 AP 1 3
AP_T2R1 AP 2 1
AP_T2R2 AP 2 2
AP_T2R3 AP 2 3
UF_T1R1 UF 1 1
UF_T1R2 UF 1 2
UF_T1R3 UF 1 3
What i've done until now is subsetting the dataset to obtain symmetrical timepoints to perform a time-course analysis using this coldata:
Sample Condition Time Replicate
AF_T0R2_2 AF 0 2
AF_T0R3_2 AF 0 3
AP_T0R1_2 AF 0 1
AF_T1R1 AF 1 1
AF_T1R2 AF 1 2
AF_T1R3 AF 1 3
AF_T3R1 AF 2 1
AF_T3R2 AF 2 2
AF_T3R3 AF 2 3
AF_T0R2 AP 0 2
AF_T0R3 AP 0 3
AP_T0R1 AP 0 1
AP_T1R1 AP 1 1
AP_T1R2 AP 1 2
AP_T1R3 AP 1 3
AP_T2R1 AP 2 1
AP_T2R2 AP 2 2
AP_T2R3 AP 2 3
The main problem with this approach is that some samples and conditions are left out of the analysis and, more importantly, i had to duplicate the T0 samples that are common to both conditions using the 2 (AFT0R22, AFT0R2 etc). Using this approach i obtained fairly good results using:
ddsTC <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata,
design = ~ Condition + Time + Condition:Time)
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ Condition + Time)
Another approach would be to make single LRT tests for each condition using ~ Time and a reduced formula of ~ 1. This way i would obtain all DEGs that changes significantly during time (even with only 2 timepoints) for each condition. Is this correct? Then i should be able to do some clustering and find intersections etc. If so can you help me with this kind of analysis? How can i exclude samples from the coldata to perform the LRT test only on, for example, the AF condition? I would like to perform an LRT for each condition using Time as factor. Something like:
Sample Condition Time Replicate UF AF AP
AF_T0R2 CP 0 2 T0 T0 T0
AF_T0R3 CP 0 3 T0 T0 T0
AF_T1R1 AF 1 1 T1
AF_T1R2 AF 1 2 T1
AF_T1R3 AF 1 3 T1
AF_T2R1 AF 2 1 T2
AF_T2R2 AF 2 2 T2
AF_T2R3 AF 2 3 T2
AF_T3R1 AF 3 1 T3
AF_T3R2 AF 3 2 T3
AF_T3R3 AF 3 3 T3
AP_T0R1 CP 0 1 T0 T0 T0
AP_T1R1 AP 1 1 T1
AP_T1R2 AP 1 2 T1
AP_T1R3 AP 1 3 T1
AP_T2R1 AP 2 1 T2
AP_T2R2 AP 2 2 T2
AP_T2R3 AP 2 3 T2
UF_T1R1 UF 1 1 T1
UF_T1R2 UF 1 2 T1
UF_T1R3 UF 1 3 T1
and using:
dds1 <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata,
design = ~ AP)
dds1 <- DESeq(dds1 test = "LRT", reduced = ~ 1)
dds2 <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata,
design = ~ AF)
dds2 <- DESeq(dds2, test = "LRT", reduced = ~ 1)
dds3 <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata,
design = ~ UF)
dds3 <- DESeq(dds3, test = "LRT", reduced = ~ 1)
leaving some of the rows empty is not working, and it is using an empty factor. I was able to obtain DEGs for each comparison using subset of the dataset, but this way the sizefactors are different for each dataset. Probably i could use estimatesizefactors for the complete dataset and then use it for the subsets. Do you think this could work? Could you provide an example of code for doing it?
I hope i was clear enough. Thank you for the help!
Guglielmo
Thank you very much. I knew it would be a mistake to duplicate samples.
Could you just tell me how to perform LRT using time for each condition? I can't find anything online or in the manual. I mean how can i subset the dataset for doing multiple LRT for groups of samples? Thanks again
Consider the set of conditions for which you have T1 samples: AF, AP, UF. You have 3 samples at T0 and then three replicates each for these three groups. What is your null hypothesis? Once you think of the null hypothesis, it becomes easier to think what the full and reduced designs should be. Since the T0 samples are not AF, AP, or UF specific, it seems like they are not useful for assessing the null hypothesis. Talking through these considerations with a local statistician would be a good idea. It's not a question of how to use the software, but how to formulate your null hypotheses.
Yes you are correct. The T0 is the starting point for all the 3 conditions. Imagine the same sample that undergo three different treatments each with a set of timepoints. I agree that the problem is finding the null hypothesis but also finding a practical way of doing what i'm thinking. I think that using T0 as a common timepoint is indeed useful because we could have genes that are down-regulated in UF at T1 and that aren't in AF at T1. We can't find these genes without using the T0.
The final goal of the analysis i finding DEGs in the following three cases: T0 -> T1 (UF) T0 -> T1 -> T2 -> T3 (AF) T0 -> T1 -> T2 (AP)
making an LRT test using Time as design and ~ 1 as reduced formula for these three cases should give all genes that significantly changed (expression) during time for each condition. The main problem is that the T0 is a common time and that i don't know how to use the LRT on sub-groups of a factor. What i would like to do is testing for significant differences among time for each case.
I would approach a statistician. It's really a question of how you want to analyze your data, not DESeq2 specific (unfortunately, I just don't have any free time right now to do statistical consulting here).
Yes you are right. Can you just tell me if it is possible to use the LRT test on only some samples using only some factors of a column?
My real problem is that i don't know how to treat the common control timepoint (T0). Should i use three individual subsets? i know it's not ideal.
Can you tell me how to use the estimatesizefactor output in a separate deseq2 experiment? Thank you for all the time you are spending on my problem!
"Can you just tell me if it is possible to use the LRT test on only some samples using only some factors of a column?"
The LRT is very general, as long as you can specify it as two design formula, or as two model matrices (see
?DESeq
), you can perform the test. However, properly specifying them so the result is meaningful is the job of the analyst. For example, the reduced design must be nested within the full. Here you really need to work with someone to help analyze this particular dataset."Can you tell me how to use the estimatesizefactor output in a separate deseq2 experiment?"
If you are asking, can you estimate size factors over the dataset, and then split into separate datasets, yes you can. Just run:
Then you can split
dds
by columns, and size factors will not be re-estimated later when you run, e.g.DESeq
.Thank you so much, yes you are right. My main problem is to use the LRT test for only a subset of the dds. Let's say to test the effect of Time in the AF samples. If i understand correctly this can't be done, because i should use a column with only the AF timepoints in the design correct?
Could you tell me how to split the dds dataset? i can't find anything in the vignette or online. Thank you again
dds
is based on the SummarizedExperiment object in Bioconductor. It can be subset using[
indexing like a matrix in R. It will then subset the colData automatically. One note is that, if you subset adds
and you have a factor in colData where you have removed all the samples corresponding to one of the levels of the factor, you shouldrelevel
the factor to remove the missing level.This is game changer for me xD. I used the relevel and droplevels function tu remove the T3 from AF and it's working correctly it seems. Thank you very much!
This way i can use the LRT for the three comparisons and obtain DEGs for each of the condition during time.