Hi, I'm struggling with finalizing some of my analysis in EdgeR because I'm not sure which is the best approach. I've read many of the posts on similar issues, but still it is not clear to me what would be an optimal approach for me. I'm new to this sort of analysis, and might have missed some clues in the user's guide or other posts.
I have a large RNAseq count data set sampled from 168 individual fish. The fish were sampled over three timepoints (T4, T5, and T6). The fish were divided in three different treatment groups (LL, SP, SPLL, representing different light regimes), and for each sampling time they were sampled from one of two conditions within their treatment (freshwater, FW, or saltwater, SW). Fish were sampled in groups of six.
There's a total of 18 sample groups (T4.LL.FW, T4.LL.SW, T4.SP.FW, T4.SP.SW, etc.)
I want to identify genes that are differentially expressed between the different light treatments, at each timepoint, utilizing only samples taken from fish in freshwater. Potentially, I would also like to know which genes are differentially expressed between the different timepoints within each of the light treatments.
Then I would like find which genes are differentially expressed between samples taken from FW and SW for each timepoint and treatment.
As I understand it, I can choose input the full RNAseq dataset, and run all contrasts of interest individually, similar to doing multiple t-tests. I would have individual lists of DEG's for each contrast which could be checked against each other for unique and overlapping DEG's, providing the opportunity to make f.ex. a Venn diagram "time-series", showing number of unique and shared DEG's between treatments at each timepoint. However, then risk of making a type I error increases with the number of individual tests run. Because I would need to run many tests, I'm sceptical of this approach.
Or, I can choose to use the ANOVA-like approach, running several contrasts together in one analysis. This is what I have now done, running 4 ANOVA-like tests, providing only the relevant subset of the count data as input for each test (Is that ok? Input was filtered by cpm, and minimum number of smaples in each group). However, I feel is not providing the full image of how DEG's shift between treatments and over time.
For light regulated genes, I'm only interested in samples from FW, so the SW samples were excluded from the input (so only 9 groups). My model is 0+group, so no intercept is used. The following contrasts were made for all three timepoints, T4.LL.FW-T4.SP.FW, T4.SP.FW-T4.SPLL.FW, T4.LL.FW-T4.SPLL.FW (a total of 9 contrasts), in object cont and tested together in the glmQLFTest.
ql <- glmQLFTest(fit, contrast = cont)
Am I correct in understanding that this will also compare between f.ex. contrasts T4.SP.FW-T4.SPLL.FW and T6.LL.FW-T6.SP.FW? But not between f.ex. T4.LL.FW-T5.LL.FW, which will be a separate contrast which I would need to specify for it to be included?
For salt regulated genes three ANOVA-like tests were run, one for each treatment, only inputting RNAseq data specific for those treatments (only LL treated samples when running LL-analysis ect.). Each ANOVA-like test consisted of three contrasts, like this: T4.LL.FW-T4.LL.SW, T5.LL.FW-T5.LL.SW, T6.LL.FW-T6-LL.SW. Model and test remained the same as in ANOVA-like test for light regulated genes. This gives the opportunity to compare between treatments over all timepoints, but not at each timepoint.
Main questions:
1. Is my approach to the ANOVA-like tests ok, or is it messy? I'm somewhat concerned of what it means to change the input used between tests, and not being consistent in just inputting all groups ( I checked, and it provides slightly different number of DEG's, so has some impact).
2. Would running individual contrast be preferable compared to the ANOVA-like appraoch?
I'm really hoping for some good advice here,
Thanks!
Great, thank you for clarifying for me how to think about the tests, appreciate it!