I've been exploring Limma's excellent gene set methods (many thanks to the developers), and I think that romer() is probably what I want to use.
I'm only just getting to grips on all the methods and their differences. I suspect that romer() doesn't provide multiple-testing adjustment parameters because it doesn't make sense in a competitive context. Am I right, or should I be making the adjustment myself?
Many thanks for any pointers,
Jon
I'm curious, is there a reason you set the inter-gene correlation manually rather than letting camera determine it from the data?
This is a new idea. We have long observed that gene sets that correspond closely to an important biological pathway tend to have high inter-gene correlations. We have also observed that geneSetTest() tends to rank gene sets well although it doesn't control the error rate correctly. camera() controls the error rate but is conservative for small sample sizes. The idea is to avoid penalizing the tightly co-regulated sets as much as camera would normally, to avoid rewarding a set for having discordant gene behaviour (negative inter gene correlation), and to gain in power. The preset correlation gives a compromise between geneSetTest() and camera() with estimated correlations. Not estimating the correlation results in a great increase in power.
Although this thread is a few months old, I was hoping to get some more information about this approach (using camera with a preset intergene correlation value). Gordon, can you clarify how one arrives at the 0.05 value for the inter.gene.cor setting? Is this value something that should be set based on a particular experiment or gene set collection? Or does 0.05 generally perform appropriately?
More specifically, would this approach be appropriate for a collection of gene sets that are co-regulated by definition (the sets were generated as lists of co-regulated genes determined from prior experimental data)? Not surprisingly, camera tends to strongly penalize these sets (which demonstrate inter-gene correlation values typically ranging from 0.1 to 0.5 in our data), reporting fairly high p values even for sets that appear to be differentially expressed in our experiments.
Thank you very much for any information you can offer.
I simply chose 0.05 by experimentation and intuition, and note that I am now suggesting 0.01 rather than 0.05. The resulting camera test will not control the type I error rate in the strict sense that camera() does, but the ranking of the sets should be good and the level of liberalness is perhaps acceptable.
My intention is to use the same 0.01 value for all the sensible sets regardless of level co-regulation. Choosing genes because they are correlated, rather than just co-differentially-expressed in a prior experiment, would be pushing it however.
Thanks for the prompt and helpful response.
To clarify using this approach for "pre-determined" co-regulated genes, I was referring to sets constructed similarly to those in MSigDB C4-CM: Cancer Modules. As I understand it, these sets are composed of genes (from existing sets) that were identified based on similar expression patterns in an integrated analysis of many publicly available datasets. As such, I would expect these sets to have high inter gene correlation values and therefore be strongly penalized by camera. However, knowing that these sets are composed of co-regulated genes, using a fixed inter.gene.cor might be problematic, as you mentioned. Am I thinking about this correctly?
Thank you again for your help.
The camera() method was designed, with or without preset correlation, with the MSigDB curated sets specifically in mind.
Keeping the correlation preset may seem counter intuitive, but not penalizing sets when they really are co-regulated sets is the whole purpose of the preset method. It is not actually true that all the MSigDB sets will be co-regulated in any specific study. Gene sets that correspond to pathways that are either not expressed or not changing in your samples will not show strong correlation. Observing a high inter-gene correlation for a set is a sign that the pathway is specifically active in your experiment and is varying between samples and hence that it is likely to be biologically relevant to your study. Hence it is specifically these sets that I want to avoid penalize.
The original camera method specifically penalizes those sets that are most likely to be biologically relevant, because the genes are co-regulated between samples. This is unavoidable if strict error rate control is to be achieved but it is biologically unfortunate. Hence the preset method is designed to gain biological relevance at some manageable cost to error rate control.
Thank you for the informative reply. I was confused by the inconsistency of baked-in multiple testing corrections across these methods and thought there might be a reason, but I'm grateful for the clarification.