GOstats/Category hyperGTest gene universe problem
1
0
Entering edit mode
James F. Reid ▴ 610
@james-f-reid-3148
Last seen 10.3 years ago
Dear list, when running two instances of a hyperGTest for KEGG pathways using different gene lists but identical gene universes I get two different gene universes in the results. This isn't the case for GO bp, mf, or cc but seems to be also for PFAM. Does anyone know why this is? Thanks. James. Example and sessionInfo() below. require("Category") require("GOstats") require("org.Hs.eg.db") set.seed(321) geneUniverse <- sample(mappedkeys(org.Hs.egSYMBOL), 2000) geneList1 <- sample(geneUniverse, 200) geneList2 <- sample(geneUniverse, 200) ## KEGG pathways parsKEGG1 <- new("KEGGHyperGParams", universeGeneIds = geneUniverse, geneIds = geneList1, annotation = "org.Hs.eg.db", testDirection = "over", pvalueCutoff = 0.01) parsKEGG2 <- parsKEGG1 geneIds(parsKEGG2) <- geneList2 testKEGG1 <- hyperGTest(parsKEGG1) testKEGG2 <- hyperGTest(parsKEGG2) universeMappedCount(testKEGG1) == universeMappedCount(testKEGG2) ## [1] FALSE universeMappedCount(testKEGG1) ## [1] 125 universeMappedCount(testKEGG2) ## [1] 137 sum(geneUniverse %in% mappedkeys(org.Hs.egPATH)) ## [1] 213 ## GO parsGO1 <- new("GOHyperGParams", universeGeneIds = geneUniverse, geneIds = geneList1, annotation = "org.Hs.eg.db", conditional = TRUE, testDirection = "over", pvalueCutoff = 0.01) parsGO2 <- parsGO1 geneIds(parsGO2) <- geneList2 ## GO BP ontology(parsGO1) <- ontology(parsGO2) <- "BP" testGObp1 <- hyperGTest(parsGO1) testGObp2 <- hyperGTest(parsGO2) universeMappedCount(testGObp1) == universeMappedCount(testGObp2) ## [1] TRUE ## the same in CC and MF ontologies sessionInfo() R version 2.11.0 (2010-04-22) x86_64-unknown-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets tools methods [8] base other attached packages: [1] org.Hs.eg.db_2.4.1 GOstats_2.14.0 RSQLite_0.9-0 [4] DBI_0.2-5 graph_1.26.0 Category_2.14.0 [7] AnnotationDbi_1.10.1 Biobase_2.8.0 loaded via a namespace (and not attached): [1] annotate_1.26.0 genefilter_1.30.0 GO.db_2.4.1 GSEABase_1.10.0 [5] RBGL_1.24.0 splines_2.11.0 survival_2.35-8 XML_3.1-0 [9] xtable_1.5-6
Annotation Pathways GO Annotation Pathways GO • 1.0k views
ADD COMMENT
0
Entering edit mode
Marc Carlson ★ 7.2k
@marc-carlson-2264
Last seen 8.4 years ago
United States
Hi James, I know that you got an answer for this from me already, but I want to post a response to the list. That was a pretty good question! In the end everything appears to be working as expected. Basically, the gene universe that is used to calculate things is limited based on the things being tested. This is more conservative than simply assuming that everything you didn't test could also have been in the pool. Remember that with a hypergeometric test, having a larger pool of things that "could have been picked" will artificially inflate the significance of your p-value, so you want to use every possible opportunity to remove untested things from the calculation. Categories that are untested are therefore removed from the universe in the final step. This is why the KEGG and PFAM universes are smaller than your estimate of 213. In the case of PFAM and KEGG, these are sparsely annotated, so the odds that something will NOT be labeled with an entire category from these is pretty good and the universe gets cut way down (typically to ~130). But for GO, things are more likely to be larger (and less variable) because of 1) the fact that more things are labeled with GO terms and 2) the DAG nature of GO annotations (which is considered when using GOstats). So it's not that GO universe can never be made smaller in this way, it's just that it usually isn't. Marc On 05/11/2010 06:45 AM, James F. Reid wrote: > Dear list, > > when running two instances of a hyperGTest for KEGG pathways using > different gene lists but identical gene universes I get two different > gene universes in the results. This isn't the case for GO bp, mf, or > cc but seems to be also for PFAM. Does anyone know why this is? > > Thanks. > James. > > Example and sessionInfo() below. > > require("Category") > require("GOstats") > require("org.Hs.eg.db") > > set.seed(321) > > geneUniverse <- sample(mappedkeys(org.Hs.egSYMBOL), 2000) > geneList1 <- sample(geneUniverse, 200) > geneList2 <- sample(geneUniverse, 200) > > ## KEGG pathways > parsKEGG1 <- new("KEGGHyperGParams", > universeGeneIds = geneUniverse, > geneIds = geneList1, > annotation = "org.Hs.eg.db", > testDirection = "over", > pvalueCutoff = 0.01) > parsKEGG2 <- parsKEGG1 > geneIds(parsKEGG2) <- geneList2 > > testKEGG1 <- hyperGTest(parsKEGG1) > testKEGG2 <- hyperGTest(parsKEGG2) > universeMappedCount(testKEGG1) == universeMappedCount(testKEGG2) > ## [1] FALSE > > universeMappedCount(testKEGG1) > ## [1] 125 > universeMappedCount(testKEGG2) > ## [1] 137 > sum(geneUniverse %in% mappedkeys(org.Hs.egPATH)) > ## [1] 213 > > ## GO > parsGO1 <- new("GOHyperGParams", > universeGeneIds = geneUniverse, > geneIds = geneList1, > annotation = "org.Hs.eg.db", > conditional = TRUE, > testDirection = "over", > pvalueCutoff = 0.01) > parsGO2 <- parsGO1 > geneIds(parsGO2) <- geneList2 > > ## GO BP > ontology(parsGO1) <- ontology(parsGO2) <- "BP" > testGObp1 <- hyperGTest(parsGO1) > testGObp2 <- hyperGTest(parsGO2) > universeMappedCount(testGObp1) == universeMappedCount(testGObp2) > ## [1] TRUE > > ## the same in CC and MF ontologies > > sessionInfo() > > R version 2.11.0 (2010-04-22) > x86_64-unknown-linux-gnu > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets tools methods > [8] base > > other attached packages: > [1] org.Hs.eg.db_2.4.1 GOstats_2.14.0 RSQLite_0.9-0 > [4] DBI_0.2-5 graph_1.26.0 Category_2.14.0 > [7] AnnotationDbi_1.10.1 Biobase_2.8.0 > > loaded via a namespace (and not attached): > [1] annotate_1.26.0 genefilter_1.30.0 GO.db_2.4.1 GSEABase_1.10.0 > [5] RBGL_1.24.0 splines_2.11.0 survival_2.35-8 XML_3.1-0 > [9] xtable_1.5-6 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT

Login before adding your answer.

Traffic: 819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6