In this reply, it is stated that Correlation Adjusted MEan RAnk gene set testing (CAMERA) is not intended for use with Gene Ontology (GO) gene sets due to their "redundancy and lack of directionality", though I could not find any other explicit mention of this. I imagine it is something to do with the inner workings of CAMERA, though the exact reason is unfortunately lost on me.
After restricting the GO sets to only those genes quantified in the experimental data and applying reasonable set size filters [10, G - 1], suppose the redundancy could be mostly addressed by using a hierarchical clustering approach similar to what is described in the MSigDB v7.0 Release Notes (sections 3.2 and 3.7). In that case, would the lack of directionality of the GO still be enough of a problem to warrant a different competitive test? Would limma::geneSetTest
be more appropriate, despite not accounting for inter-gene correlation?
Note: I am aware that, in the RNA-Seq Analysis is Easy as 1, 2, 3 vignette and the Gene-expression data integration to squamous cell lung cancer subtypes reveals drug sensitivity publication, CAMERA is used in conjunction with sets from the C2 collection of the Molecular Signatures Database (MSigDB), where a number of terms (though perhaps not all) indicate directionality with "_UP" or "_DN" suffixes.
Thank you for the response! It was very helpful. Below are my thoughts as I was reading.
I hadn't realized that. I will review the GO curation process more thoroughly.
Yes this is quite a headache to deal with. I've found that the hierarchical clustering procedure used for the MSigDB can remedy this to some degree.
I'm glad I was wrong! I read the CAMERA paper a couple times and had convinced myself I missed something crucial.
Thank you for the link to the MSigDB RData files. Much nicer than having to download the GMT files and convert them to a list every time.
Regarding the pruning done to the GO gene sets: I have observed that even if redundancy was reduced as with MSigDB, restricting the gene sets to only those genes quantified in the experiment always increases or introduces redundancy. In the worst case, it leads to aliasing where two or more sets contain the exact same elements, but they appear under different names. The hierarchical clustering can address aliasing and deal with the worst of the redundancies, though that still doesn't solve the lack of directionality of the GO sets. Also, gene sets may lose the majority of their genes during the restriction step, which makes it difficult to know if what remains is accurately described by the term name (related to the lack of specificity that you mentioned at the beginning).
I will take a look at
kegga
as an alternative. Is this the associated publication? I will also send you the code I used for pruning in case you find a use for it. It is essentially what is used for the MSigDB GO terms, but quite a bit faster and more memory efficient.Oops, I had meant to write
goana
instead ofkegga
. The first does GO analyses and the second does KEGG pathway analysis, althoughkegga
can do both if you supply your own annotation files.Yes, Young et al (Genome Biology, 2010) is the best publication for
goana
. The paper describes the ability ofgoana
,kegga
andgoseq
to correct for gene-length biases in the DE results. The basic functionality ofgoana
, which tests for overlap of the DE list with annotation terms using hypergeometric tests, is however so well known that it doesn't warrant a publication.