Hi,
From the limmaUsersGuide()
section "9.7 Multi-level Experiments" I learned about using duplicateCorrelation()
and passing that information to lmFit()
to handle observations from the same individual (using block
and correlation
arguments). Similarly, from "9.6.2 Many time points" I learned about how to multiply splines with a group variable and extract the F-statistic showing the differences between groups across time using topTable(coef)
. Those sections don't use voom()
as they were likely written before voom()
existed.
In any case, I'm analyzing a data set with 2 regions and individuals spanning many age groups. Some individuals yielded data for both regions, some did not. To explore the correlation by individual, I ran limma-voom with and without using duplicateCorrelation()
. The p-values and adjusted-p-values are smaller for the model that takes the individual correlation into account, but are fairly similar. However, a subset of the F-statistics are very different.
First of all, I was surprised by the range of some of the F-statistics (over 3 in log10). Second, this plot doesn't agree with how similar the p-values look like. How come the 2 models yield similar p-values for genes with widely different F-statistics? (the degrees of freedom should be the same)
I then explored the coefficients of interest (the interaction between the 2nd region and the time linear splines) and they look pretty different (see last pages of the pdf in the gist).
Assuming that I used the functions correctly, do you expect the F-statistics to be so different for a few of genes? What about the coefficients? I know that lmFit()
changes between lm.series and gls.series for the 2 different ways I used it, but I expected the coefficients to look mostly similar given that the consensus correlation is about 0.24 (much lower than the default of 0.75). Or could the lm.series vs gls.series explain the difference here? ***At this point I realize that I didn't use topTable() correctly***
Best,
Leonardo
Code used (shows parts of the design matrix and session info):
Hi,
While I did figure out my main problem, I decided it would still be worth posting just as an example of how completing all the Bioconductor posting guide steps can help users find their own problem.
In my case, when I was writing the last part of my question, I realized that maybe I wasn't comparing the correct genes between the 2 models. It turns out that I was missing
sort.by = 'none'
in mytopTable()
calls. The fixed plots are available at .Now most of the plots makes sense. The p-values (and adjusted-p-values) are no longer so similar as they once were: from the topTable() docs I thought that topTable() sorted the genes by F in this case, not p-value (they are p-value sorted, which explains the previous plots). I'm still surprised by the range of the F-statistics though.
In general though, I'm still unsure how to determine whether to go ahead with the
lmFit()
results from using or not usingduplicateCorrelation()
. How would you choose which model to use? I expected the results from using the correlation to lead to less DE calls. Using it might be too time consuming for expression features more numerous than genes (about 24k here).## In general: DE signal among models
> addmargins(table('Using corr' = top$adj.P.Val < 0.05, 'Naive' = top2$adj.P.Val < 0.05))
Naive
Using corr FALSE TRUE Sum
FALSE 2882 361 3243
TRUE 919 20490 21409
Sum 3801 20851 24652
Best,
Leonardo