Checking different results of hypergeometric tests using GOstats, I found that, surprisingly, using the same gene universe for 2 different gene enriched sets, there are different number of GO term sizes for the same GO term:
An example
Gene set enr: 1480
GOBPID | Pvalue | OddsRatio | ExpCount | Count | Size | Term |
---|---|---|---|---|---|---|
GO:0098609 | 0.0000012 | 2.2956084 | 38 | 64 | 149 | cell-cell adhesion |
Gene set enr: 68
GOBPID | Pvalue | OddsRatio | ExpCount | Count | Size | Term |
---|---|---|---|---|---|---|
GO:0098609 | 0.0005064 | 3.5353535 | 4 | 12 | 342 | cell-cell adhesion |
GENE UNIVERSE (same for both analyses): 5811
Why the "Size" (149 vs. 342) is different between both analyses?
Here it is the code:
GOmodes <- c("BP", "MF", "CC") param_list <- list() params <- NULL for (i in seq_along(GOmodes)){ params <- new("GOHyperGParams", geneIds=entrez_geneSet_1, universeGeneIds=entrezgeneUniverse, ontology=GOmodes[i], annotation="org.Mm.eg.db", pvalueCutoff=0.01, conditional=TRUE, testDirection="over") param_list[[i]] <- params } # run the test in a loop GO.results.bp_1 <- lapply(param_list, hyperGTest) GO.results.bp_1 sapply(GO.results.bp_1, htmlReport, file=file.path(out_dir, "GO.results_1.html"), append=T, digits=7)
param_list <- list() params <- NULL for (i in seq_along(GOmodes)){ params <- new("GOHyperGParams", geneIds=entrez_geneSet_2, universeGeneIds=entrezgeneUniverse, ontology=GOmodes[i], annotation="org.Mm.eg.db", pvalueCutoff=0.01, conditional=TRUE, testDirection="over") param_list[[i]] <- params } # run the test in a loop GO.results.bp_2 <- lapply(param_list, hyperGTest) GO.results.bp_2 sapply(GO.results.bp_2, htmlReport, file=file.path(out_dir, "GO.results_2.html"), append=T, digits=7)
When I check the results in the html file or check them by
summary(GO.results.bp_1) summary(GO.results.bp_2)
I find the results commented before.
sessionInfo():
R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.5 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 grid parallel stats graphics grDevices utils datasets methods base other attached packages: [1] org.Mm.eg.db_3.4.0 GOstats_2.40.0 graph_1.52.0 Category_2.40.0 Matrix_1.2-8 [6] AnnotationDbi_1.36.2 IRanges_2.8.1 S4Vectors_0.12.1 pcaMethods_1.66.0 Biobase_2.34.0 [11] BiocGenerics_0.20.0 gplots_3.0.1 VennDiagram_1.6.17 futile.logger_1.4.3 biomaRt_2.30.0 [16] rMQanalysis_0.3.1.9001 RColorBrewer_1.1-2 ggrepel_0.6.5 ggplot2_2.2.1 reshape2_1.4.2 [21] plyr_1.8.4 loaded via a namespace (and not attached): [1] Rcpp_0.12.9 lattice_0.20-34 GO.db_3.4.0 tidyr_0.6.1 zoo_1.7-14 [6] gtools_3.5.0 assertthat_0.1 digest_0.6.12 R6_2.2.0 futile.options_1.0.0 [11] RSQLite_1.1-2 httr_1.2.1 lazyeval_0.2.0 data.table_1.10.4 annotate_1.52.1 [16] gdata_2.17.0 splines_3.3.2 stringr_1.1.0 htmlwidgets_0.8 RCurl_1.95-4.8 [21] munsell_0.4.3 base64enc_0.1-3 htmltools_0.3.5 tibble_1.2 XML_3.98-1.5 [26] AnnotationForge_1.16.1 viridisLite_0.1.3 dplyr_0.5.0 bitops_1.0-6 RBGL_1.50.0 [31] jsonlite_1.2 xtable_1.8-2 GSEABase_1.36.0 gtable_0.2.0 DBI_0.5-1 [36] magrittr_1.5 scales_0.4.1 KernSmooth_2.23-15 stringi_1.1.2 genefilter_1.56.0 [41] sp_1.2-4 splitstackshape_1.4.2 lambda.r_1.1.9 tools_3.3.2 logspline_2.1.9 [46] purrr_0.2.2 survival_2.40-1 colorspace_1.3-2 caTools_1.17.1 memoise_1.0.0 [51] plotly_4.5.6
Thank you for the good explanation James! Before it was not clear to me how the 'conditional' test works, now it is crystal clear.