Question

Time-course group difference analysis with repeated individuals using limma-voom

0

Entering edit mode

Leonardo Collado Torres ★ 1.1k

@lcolladotor

Last seen 5 weeks ago

United States

Hi,

From the limmaUsersGuide() section "9.7 Multi-level Experiments" I learned about using duplicateCorrelation() and passing that information to lmFit() to handle observations from the same individual (using block and correlation arguments). Similarly, from "9.6.2 Many time points" I learned about how to multiply splines with a group variable and extract the F-statistic showing the differences between groups across time using topTable(coef). Those sections don't use voom() as they were likely written before voom() existed.

In any case, I'm analyzing a data set with 2 regions and individuals spanning many age groups. Some individuals yielded data for both regions, some did not. To explore the correlation by individual, I ran limma-voom with and without using duplicateCorrelation(). The p-values and adjusted-p-values are smaller for the model that takes the individual correlation into account, but are fairly similar. However, a subset of the F-statistics are very different.

First of all, I was surprised by the range of some of the F-statistics (over 3 in log10). Second, this plot doesn't agree with how similar the p-values look like. How come the 2 models yield similar p-values for genes with widely different F-statistics? (the degrees of freedom should be the same)

I then explored the coefficients of interest (the interaction between the 2nd region and the time linear splines) and they look pretty different (see last pages of the pdf in the gist).

Assuming that I used the functions correctly, do you expect the F-statistics to be so different for a few of genes? What about the coefficients? I know that lmFit() changes between lm.series and gls.series for the 2 different ways I used it, but I expected the coefficients to look mostly similar given that the consensus correlation is about 0.24 (much lower than the default of 0.75). Or could the lm.series vs gls.series explain the difference here? ***At this point I realize that I didn't use topTable() correctly***

Best,

Leonardo

Code used (shows parts of the design matrix and session info):

limma • 1.0k views

ADD COMMENT • link updated 6.9 years ago by Aaron Lun ★ 28k • written 6.9 years ago by Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Hi,

While I did figure out my main problem, I decided it would still be worth posting just as an example of how completing all the Bioconductor posting guide steps can help users find their own problem.

In my case, when I was writing the last part of my question, I realized that maybe I wasn't comparing the correct genes between the 2 models. It turns out that I was missing sort.by = 'none' in my topTable() calls. The fixed plots are available at .

## For the MA like plot, although this table is no longer useful
#                less than 3 on x
# close to 0 on Y FALSE  TRUE   Sum
#           FALSE   640  9165  9805
#           TRUE   1702 13145 14847
#           Sum    2342 22310 24652

Now most of the plots makes sense. The p-values (and adjusted-p-values) are no longer so similar as they once were: from the topTable() docs I thought that topTable() sorted the genes by F in this case, not p-value (they are p-value sorted, which explains the previous plots). I'm still surprised by the range of the F-statistics though.

In general though, I'm still unsure how to determine whether to go ahead with the lmFit() results from using or not using duplicateCorrelation(). How would you choose which model to use? I expected the results from using the correlation to lead to less DE calls. Using it might be too time consuming for expression features more numerous than genes (about 24k here).

## In general: DE signal among models

> addmargins(table('Using corr' = top$adj.P.Val < 0.05, 'Naive' = top2$adj.P.Val < 0.05)) Naive Using corr FALSE TRUE Sum FALSE 2882 361 3243 TRUE 919 20490 21409 Sum 3801 20851 24652

Best,

Leonardo

ADD REPLY • link 6.9 years ago Leonardo Collado Torres ★ 1.1k

score 1 · Accepted Answer · 2017-11-22

Regarding the choice between models: reading your comment indicates that the models share 20490 DE genes, with only 361 and 919 DE genes unique to each model. This suggests that, relatively speaking, there's not much difference between the models to worry about. (It is quite a large number of DE genes, which may make it difficult to interpret the results. I would have suggested using treat but that's not really possible for F-tests.)

From a theoretical perspective, I would go with duplicateCorrelation here - a consensus correlation of 0.24 suggests that the individual of origin does have some effect, as one might expect, and it would be best to account for that. That said, the similarities between the two models means that you could set up and test your analysis using the naive approach, then switch over to duplicateCorrelation and let it run overnight.