Question

EdgeR -ANOVA-like test vs testing individual contrasts

0

Entering edit mode

Maiv • 0

@maiv-13337

Last seen 6.2 years ago

Norway

Hi, I'm struggling with finalizing some of my analysis in EdgeR because I'm not sure which is the best approach. I've read many of the posts on similar issues, but still it is not clear to me what would be an optimal approach for me. I'm new to this sort of analysis, and might have missed some clues in the user's guide or other posts.

I have a large RNAseq count data set sampled from 168 individual fish. The fish were sampled over three timepoints (T4, T5, and T6). The fish were divided in three different treatment groups (LL, SP, SPLL, representing different light regimes), and for each sampling time they were sampled from one of two conditions within their treatment (freshwater, FW, or saltwater, SW). Fish were sampled in groups of six.

There's a total of 18 sample groups (T4.LL.FW, T4.LL.SW, T4.SP.FW, T4.SP.SW, etc.)

I want to identify genes that are differentially expressed between the different light treatments, at each timepoint, utilizing only samples taken from fish in freshwater. Potentially, I would also like to know which genes are differentially expressed between the different timepoints within each of the light treatments.

Then I would like find which genes are differentially expressed between samples taken from FW and SW for each timepoint and treatment.

As I understand it, I can choose input the full RNAseq dataset, and run all contrasts of interest individually, similar to doing multiple t-tests. I would have individual lists of DEG's for each contrast which could be checked against each other for unique and overlapping DEG's, providing the opportunity to make f.ex. a Venn diagram "time-series", showing number of unique and shared DEG's between treatments at each timepoint. However, then risk of making a type I error increases with the number of individual tests run. Because I would need to run many tests, I'm sceptical of this approach.

Or, I can choose to use the ANOVA-like approach, running several contrasts together in one analysis. This is what I have now done, running 4 ANOVA-like tests, providing only the relevant subset of the count data as input for each test (Is that ok? Input was filtered by cpm, and minimum number of smaples in each group). However, I feel is not providing the full image of how DEG's shift between treatments and over time.

For light regulated genes, I'm only interested in samples from FW, so the SW samples were excluded from the input (so only 9 groups). My model is 0+group, so no intercept is used. The following contrasts were made for all three timepoints, T4.LL.FW-T4.SP.FW, T4.SP.FW-T4.SPLL.FW, T4.LL.FW-T4.SPLL.FW (a total of 9 contrasts), in object cont and tested together in the glmQLFTest.

ql <- glmQLFTest(fit, contrast = cont)

Am I correct in understanding that this will also compare between f.ex. contrasts T4.SP.FW-T4.SPLL.FW and T6.LL.FW-T6.SP.FW? But not between f.ex. T4.LL.FW-T5.LL.FW, which will be a separate contrast which I would need to specify for it to be included?

For salt regulated genes three ANOVA-like tests were run, one for each treatment, only inputting RNAseq data specific for those treatments (only LL treated samples when running LL-analysis ect.). Each ANOVA-like test consisted of three contrasts, like this: T4.LL.FW-T4.LL.SW, T5.LL.FW-T5.LL.SW, T6.LL.FW-T6-LL.SW. Model and test remained the same as in ANOVA-like test for light regulated genes. This gives the opportunity to compare between treatments over all timepoints, but not at each timepoint.

Main questions:

1. Is my approach to the ANOVA-like tests ok, or is it messy? I'm somewhat concerned of what it means to change the input used between tests, and not being consistent in just inputting all groups ( I checked, and it provides slightly different number of DEG's, so has some impact).

2. Would running individual contrast be preferable compared to the ANOVA-like appraoch?

I'm really hoping for some good advice here,

Thanks!

EdgeR ANOVA t-test contrast • 2.6k views

ADD COMMENT • link updated 7.8 years ago by Aaron Lun ★ 28k • written 7.8 years ago by Maiv • 0

score 2 · Accepted Answer · 2017-06-26

The pairwise contrasts and the ANODEV test different null hypotheses and should be used accordingly:

I want to identify genes that are differentially expressed between the different light treatments, at each timepoint, utilizing only samples taken from fish in freshwater.

Do an ANODEV for each time point, comparing the three treatments (for freshwater fish). Here, the null hypothesis for each gene is that there is no DE between treatments at a given time point. Thus, any rejections will correspond to genes that are DE between treatments, which is what you want. For example:

con <- makeContrasts(T4.LL.FW - T4.SP.FW, 
                     T4.LL.FW - T4.SPLL.FW, levels=design)

... will test the null hypothesis that LL=SP=SPLL for time point 4 in freshwater fish. You can think of makeContrasts as constructing a null hypothesis for the ANODEV based on a series of AND statements, i.e., T4.LL.FW - T4.SP.FW = 0 AND T4.LL.FW - T4.SPLL.FW = 0. Obviously this means you don't need to explicitly specify T4.SPLL.FW - T4.SP.FW = 0.

Potentially, I would also like to know which genes are differentially expressed between the different timepoints within each of the light treatments.

Do an ANODEV for each treatment, comparing the four time points. Here, the null hypothesis for each gene is that there is no DE between time points for a given treatment. In this case, the makeContrasts call would involve three expressions.

Then I would like find which genes are differentially expressed between samples taken from FW and SW for each timepoint and treatment.

Do pairwise comparisons between the fresh and saltwater groups for each combination of time point and treatment. Here, the null hypothesis is that there is no difference between the fresh and saltwater groups at that particular time/treatment combination.

Easy. I should stress that you don't need to subset the data at any point, which defeats the purpose of having many samples to help you estimate the variance. Just set up the contrast matrix to compare only the relevant groups for a given contrast, and glmQLFTest will handle the rest. If you want to control the error rate, just pool all p-values together and apply the BH method to the pool.