Why do I get different GO term sizes using the same gene universe for 2 analyses with GOstats?
1
0
Entering edit mode
tlaguna ▴ 10
@tlaguna-12370
Last seen 7.8 years ago
IMB Mainz

Checking different results of hypergeometric tests using GOstats, I found that, surprisingly, using the same gene universe for 2 different gene enriched sets, there are different number of GO term sizes for the same GO term:

 

An example

Gene set enr: 1480

GOBPID Pvalue OddsRatio ExpCount Count Size Term
GO:0098609 0.0000012 2.2956084 38 64 149 cell-cell adhesion

 

Gene set enr: 68

GOBPID Pvalue OddsRatio ExpCount Count Size Term
GO:0098609 0.0005064 3.5353535 4 12 342 cell-cell adhesion

 

GENE UNIVERSE (same for both analyses): 5811

Why the "Size" (149 vs. 342) is different between both analyses?

 

Here it is the code:

GOmodes <- c("BP", "MF", "CC")
param_list <- list()
params <- NULL
for (i in seq_along(GOmodes)){
  params <- new("GOHyperGParams", geneIds=entrez_geneSet_1, universeGeneIds=entrezgeneUniverse,
                ontology=GOmodes[i], annotation="org.Mm.eg.db", pvalueCutoff=0.01, conditional=TRUE,
                testDirection="over")
  param_list[[i]] <- params
}

# run the test in a loop
GO.results.bp_1 <- lapply(param_list, hyperGTest)
GO.results.bp_1  

sapply(GO.results.bp_1, htmlReport, file=file.path(out_dir, "GO.results_1.html"),
       append=T, digits=7)

param_list <- list()
params <- NULL
for (i in seq_along(GOmodes)){
  params <- new("GOHyperGParams", geneIds=entrez_geneSet_2, universeGeneIds=entrezgeneUniverse,
                ontology=GOmodes[i], annotation="org.Mm.eg.db", pvalueCutoff=0.01, conditional=TRUE,
                testDirection="over")
  param_list[[i]] <- params
}

# run the test in a loop
GO.results.bp_2 <- lapply(param_list, hyperGTest)
GO.results.bp_2  

sapply(GO.results.bp_2, htmlReport, file=file.path(out_dir, "GO.results_2.html"),
       append=T, digits=7)

 

When I check the results in the html file or check them by

summary(GO.results.bp_1) 

summary(GO.results.bp_2)

I find the results commented before.

 

sessionInfo():

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] stats4    grid      parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Mm.eg.db_3.4.0     GOstats_2.40.0         graph_1.52.0           Category_2.40.0        Matrix_1.2-8          
 [6] AnnotationDbi_1.36.2   IRanges_2.8.1          S4Vectors_0.12.1       pcaMethods_1.66.0      Biobase_2.34.0        
[11] BiocGenerics_0.20.0    gplots_3.0.1           VennDiagram_1.6.17     futile.logger_1.4.3    biomaRt_2.30.0        
[16] rMQanalysis_0.3.1.9001 RColorBrewer_1.1-2     ggrepel_0.6.5          ggplot2_2.2.1          reshape2_1.4.2        
[21] plyr_1.8.4            

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9            lattice_0.20-34        GO.db_3.4.0            tidyr_0.6.1            zoo_1.7-14            
 [6] gtools_3.5.0           assertthat_0.1         digest_0.6.12          R6_2.2.0               futile.options_1.0.0  
[11] RSQLite_1.1-2          httr_1.2.1             lazyeval_0.2.0         data.table_1.10.4      annotate_1.52.1       
[16] gdata_2.17.0           splines_3.3.2          stringr_1.1.0          htmlwidgets_0.8        RCurl_1.95-4.8        
[21] munsell_0.4.3          base64enc_0.1-3        htmltools_0.3.5        tibble_1.2             XML_3.98-1.5          
[26] AnnotationForge_1.16.1 viridisLite_0.1.3      dplyr_0.5.0            bitops_1.0-6           RBGL_1.50.0           
[31] jsonlite_1.2           xtable_1.8-2           GSEABase_1.36.0        gtable_0.2.0           DBI_0.5-1             
[36] magrittr_1.5           scales_0.4.1           KernSmooth_2.23-15     stringi_1.1.2          genefilter_1.56.0     
[41] sp_1.2-4               splitstackshape_1.4.2  lambda.r_1.1.9         tools_3.3.2            logspline_2.1.9       
[46] purrr_0.2.2            survival_2.40-1        colorspace_1.3-2       caTools_1.17.1         memoise_1.0.0         
[51] plotly_4.5.6   

 

gostats hypergeometric test • 1.7k views
ADD COMMENT
4
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States

Please note that the GO ontology is a directed acyclic graph (DAG), which operates like a dendrogram, where lower terms are combined into higher terms. So if you have a parent term that has two child terms, then all the genes that are annotated to the two child terms are also by default annotated to the parent term. So if you have a (fake) term called 'cellular signaling' and it has two child terms (also fake!) called 'positive cellular signaling' and 'negative cellular signaling', then ALL the genes that are considered to be 'negative cellular signaling' genes are also by default considered to be 'cellular signaling' genes as well. The same is true of 'positive cellular signaling' genes.

If you do a hypergeometric test for one of the child terms, and it is significant, then you could argue that testing at the parent level and still including all the genes that gave rise to a significant result in a child term would not be ideal, because the parent term might end up being significant too, but only because of the contribution from the genes that caused the child term to be significant. In general we want more significance at the lower branches of the DAG, because those are more specific terms that may imply perturbed pathways. But the higher terms are more broad and not as easily interpreted. For example, having a significant 'biological process' term is just boring and uninterpretable.

This is the basis of the 'conditional' hypergeometric test, which is what you use when you set conditional = TRUE (which is what you did). If a child term is significant, then the genes that gave rise to that significant result are eliminated from the test of the parent term. So the number of genes that are annotated to a particular GO term are dependent on how many of those genes were eliminated at lower levels of the GO DAG. If you set conditional = FALSE, the category size remains fixed and will be the same for both tests.

ADD COMMENT
0
Entering edit mode

Thank you for the good explanation James! Before it was not clear to me how the 'conditional' test works, now it is crystal clear.

ADD REPLY

Login before adding your answer.

Traffic: 755 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6