CAMERA + Gene Ontology Gene Sets
1
0
Entering edit mode
@df6c68e9
Last seen 6 weeks ago
United States

In this reply, it is stated that Correlation Adjusted MEan RAnk gene set testing (CAMERA) is not intended for use with Gene Ontology (GO) gene sets due to their "redundancy and lack of directionality", though I could not find any other explicit mention of this. I imagine it is something to do with the inner workings of CAMERA, though the exact reason is unfortunately lost on me.

After restricting the GO sets to only those genes quantified in the experimental data and applying reasonable set size filters [10, G - 1], suppose the redundancy could be mostly addressed by using a hierarchical clustering approach similar to what is described in the MSigDB v7.0 Release Notes (sections 3.2 and 3.7). In that case, would the lack of directionality of the GO still be enough of a problem to warrant a different competitive test? Would limma::geneSetTest be more appropriate, despite not accounting for inter-gene correlation?

Note: I am aware that, in the RNA-Seq Analysis is Easy as 1, 2, 3 vignette and the Gene-expression data integration to squamous cell lung cancer subtypes reveals drug sensitivity publication, CAMERA is used in conjunction with sets from the C2 collection of the Molecular Signatures Database (MSigDB), where a number of terms (though perhaps not all) indicate directionality with "_UP" or "_DN" suffixes.

GO limma CAMERA • 839 views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 34 minutes ago
WEHI, Melbourne, Australia

GO is a very large collection of gene annotation terms. It doesn't work terribly well with competitive gene sets because it contains a lot of very broad terms and because the GO terms are mostly non-directional. The GO term for a particular biological process will typically contain all genes loosely associated with that process, including inhibitors as well as promoters of the process. So a GO term might correspond to a highly relevant biological process but still not be strongly up-regulated or down-regulated in the DE results.

The GO collection is also hierarchical, with all GO terms being subsets of their parent sets.

There is no mathematical or statistical reason why you can't run CAMERA on GO gene sets, but the nature of the GO collection means that (i) statistical power might be reduced and (ii) the interpretation of the results might not be clear. These are scientific issues rather than mathematical issues and they affect all competitive gene tests such as CAMERA, geneSetTest, GSEA etc. It is not a hidden issue to do with the inner workings of CAMERA!

The MSigDB collection prunes the GO term collection to remove genesets that are too large or too small and to reduce redundancy. If you want to run CAMERA on GO terms, then it would indeed to be a good idea to use the curated GO collection from the MSigDB. The limma team provides the MSigDB collection in an R-friendly format ready to be input into limma and CAMERA, see:

Beware though that our mouse version of MSigDB recreates the GO gene sets from scratch using mouse annotation without the same sort of set pruning as done by the Broad Institute. Maybe we should revisit this, but I thought it better to allow users to do their own pruning.

Alternatively, the set size filter that you mention would also be a big help and would substantially mitigate the problems.

In my own work, I have tended to test GO terms using overlap tests, i.e., using goana and kegga rather than camera, largely because the simple overlap tests aren't affected by the GO term redundancies. Both approaches have their advantages. If we went to the trouble of pruning the GO gene set collection for mouse, then it might well make sense for us to use CAMERA more often with GO.

ADD COMMENT
0
Entering edit mode

Thank you for the response! It was very helpful. Below are my thoughts as I was reading.

The GO term for a particular biological process will typically contain all genes loosely associated with that process, whether they are promoters or inhibitors.

I hadn't realized that. I will review the GO curation process more thoroughly.

The GO collection is also hierarchical, with all GO terms being subsets of their parent sets.

Yes this is quite a headache to deal with. I've found that the hierarchical clustering procedure used for the MSigDB can remedy this to some degree.

There is no mathematical or statistical reason why you can't run CAMERA on GO gene sets... It is not a hidden issue to do with the inner workings of CAMERA!

I'm glad I was wrong! I read the CAMERA paper a couple times and had convinced myself I missed something crucial.

Thank you for the link to the MSigDB RData files. Much nicer than having to download the GMT files and convert them to a list every time.

Regarding the pruning done to the GO gene sets: I have observed that even if redundancy was reduced as with MSigDB, restricting the gene sets to only those genes quantified in the experiment always increases or introduces redundancy. In the worst case, it leads to aliasing where two or more sets contain the exact same elements, but they appear under different names. The hierarchical clustering can address aliasing and deal with the worst of the redundancies, though that still doesn't solve the lack of directionality of the GO sets. Also, gene sets may lose the majority of their genes during the restriction step, which makes it difficult to know if what remains is accurately described by the term name (related to the lack of specificity that you mentioned at the beginning).

I will take a look at kegga as an alternative. Is this the associated publication? I will also send you the code I used for pruning in case you find a use for it. It is essentially what is used for the MSigDB GO terms, but quite a bit faster and more memory efficient.

ADD REPLY
0
Entering edit mode

Oops, I had meant to write goana instead of kegga. The first does GO analyses and the second does KEGG pathway analysis, although kegga can do both if you supply your own annotation files.

Yes, Young et al (Genome Biology, 2010) is the best publication for goana. The paper describes the ability of goana, kegga and goseq to correct for gene-length biases in the DE results. The basic functionality of goana, which tests for overlap of the DE list with annotation terms using hypergeometric tests, is however so well known that it doesn't warrant a publication.

ADD REPLY

Login before adding your answer.

Traffic: 318 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6