Hello,
I am using DESeq2 on R (version 3.2.2) for analysis of small RNA expression data. I have some questions about how to set-up the experiment to answer the questions I am interested in.
I have a total of 80 samples. This consists of 4 different stages (ages) of fish embryo. At each stage, there are 5 treatments - this includes a control, 2 anoxic exposure treatments, and 2 aerobic recovery exposure treatments. n=4 for each stage/treatment combination. Here is table of the data:
Stage |
Treatment |
(treatment description) |
D2 |
A |
Control |
D2 |
A |
Control |
D2 |
A |
Control |
D2 |
A |
Control |
D2 |
B |
Early anoxia |
D2 |
B |
Early anoxia |
D2 |
B |
Early anoxia |
D2 |
B |
Early anoxia |
D2 |
C |
Late anoxia |
D2 |
C |
Late anoxia |
D2 |
C |
Late anoxia |
D2 |
C |
Late anoxia |
D2 |
D |
Early recovery |
D2 |
D |
Early recovery |
D2 |
D |
Early recovery |
D2 |
D |
Early recovery |
D2 |
E |
Late recovery |
D2 |
E |
Late recovery |
D2 |
E |
Late recovery |
D2 |
E |
Late recovery |
4DPD |
A |
Control |
4DPD |
A |
Control |
4DPD |
A |
Control |
4DPD |
A |
Control |
4DPD |
B |
Early anoxia |
4DPD |
B |
Early anoxia |
4DPD |
B |
Early anoxia |
4DPD |
B |
Early anoxia |
4DPD |
C |
Late anoxia |
4DPD |
C |
Late anoxia |
4DPD |
C |
Late anoxia |
4DPD |
C |
Late anoxia |
4DPD |
D |
Early recovery |
4DPD |
D |
Early recovery |
4DPD |
D |
Early recovery |
4DPD |
D |
Early recovery |
4DPD |
E |
Late recovery |
4DPD |
E |
Late recovery |
4DPD |
E |
Late recovery |
4DPD |
E |
Late recovery |
12DPD |
A |
Control |
12DPD |
A |
Control |
12DPD |
A |
Control |
12DPD |
A |
Control |
12DPD |
B |
Early anoxia |
12DPD |
B |
Early anoxia |
12DPD |
B |
Early anoxia |
12DPD |
B |
Early anoxia |
12DPD |
C |
Late anoxia |
12DPD |
C |
Late anoxia |
12DPD |
C |
Late anoxia |
12DPD |
C |
Late anoxia |
12DPD |
D |
Early recovery |
12DPD |
D |
Early recovery |
12DPD |
D |
Early recovery |
12DPD |
D |
Early recovery |
12DPD |
E |
Late recovery |
12DPD |
E |
Late recovery |
12DPD |
E |
Late recovery |
12DPD |
E |
Late recovery |
20DPD |
A |
Control |
20DPD |
A |
Control |
20DPD |
A |
Control |
20DPD |
A |
Control |
20DPD |
B |
Early anoxia |
20DPD |
B |
Early anoxia |
20DPD |
B |
Early anoxia |
20DPD |
B |
Early anoxia |
20DPD |
C |
Late anoxia |
20DPD |
C |
Late anoxia |
20DPD |
C |
Late anoxia |
20DPD |
C |
Late anoxia |
20DPD |
D |
Early recovery |
20DPD |
D |
Early recovery |
20DPD |
D |
Early recovery |
20DPD |
D |
Early recovery |
20DPD |
E |
Late recovery |
20DPD |
E |
Late recovery |
20DPD |
E |
Late recovery |
20DPD |
E |
Late recovery |
My goals are to (1) identify smRNAs (genes) differentially expressed in response to anoxia/recovery at each stage (i.e. within D2, within 4dpd...); (2) be able to compare expression of a small RNA in one stage to another stage (i.e. between D2 and 4dpd).
To do this, I want to enter all samples together so I can normalize and filter the samples together. This will keep them on the same scale for comparison. However, I only want to run comparisons and generate p-values within each stage, with LRT. Is this possible?
From what I've read (Desq2 vignette, other posts), it seems I need to perform some combination of LRT and interaction terms. I have come up with the following commands:
dds <- DESeqDataSetFromMatrix(countData=Anox_countData_sample,colData=colData,design = ~ stage+treatment+stage:treatment)
dds$group <- factor(paste0(dds$stage,dds$treatment))
design(dds)<-~group
dds <-DESeq(dds, test="LRT", reduced = ~1, full= ~ group)
resultsNames(dds) # note: I have not been able to get to this part, so I don't know what options will come up
results(dds, contrast=c("group","D2A","D2B","D2C","D2D","D2E")) #not sure about this part, see above
Here are my questions:
1. Is this the correct approach for my question?
2. My understanding is that contrasts only pulls out the log2FC. Is there some other command I am missing to call for the p-value within the specific stage?
3. On controls: each stage has its own control (untreated). When I used the word "control" in coldata, I receive the following error:
> dds <-DESeqDataSetFromMatrix(countData=Anox_countData_sample,colData=colData,design = ~ stage+treatment+stage:treatment)
it appears that the last variable in the design formula, 'treatment',
has a factor level, 'control', which is not the reference level. we recommend
to use factor(...,levels=...) or relevel() to set this as the reference level
before proceeding. for more information, please see the 'Note on factor levels'
in vignette('DESeq2').
I don't actually want control set as the reference level. I just want a p-value for LRT (like an ANOVA) at each stage. Is the best way to do this to just change the treatments to letters? A-E?
4. I tried this on a subset of my data and received the following error:
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, :
every gene contains at least one zero, cannot compute log geometric means
I am not surprised that there is at least one zero in each gene, since this includes all stages and they are biologically distinct. How can I work with this?
5. I am starting with 10 GB of data, about 13 million sequences. If I run all 80 samples together, as I am suggesting, it seems appropriate to pre-filter. When should this be done?
I think the easiest thing would be to test the LRT on subsets of the data. You can still normalize and filter the data all together though.
dds <- estimateSizeFactors(dds) mnc <- rowMeans(counts(dds, normalized=TRUE)) dds <- dds[ mnc > threshold, ]
Within each stage, you have enough residual degrees of freedom to estimate the dispersions for each stage separately in my opinion (20 samples - 5 group means = 15 d.f.). This would look like:
dds.stage1 <- dds[ , dds$stage == "1" ] dds.stage1 <- DESeq(dds.stage1, test="LRT", full=~treatment, reduced=~1) res <- results(dds.stage1, independentFiltering=FALSE)
It is possible to perform LRT on the whole object, but this would require you to form the model matrix yourself outside of DESeq() and then manually remove columns associated with the treatment effects for a given stage that you want to test. The above approach is much simpler.
If you want to compare within a treatment across stage, build a new factor for the full dds object:
Then use results() and the contrast argument to compare different levels of 'group':