hi,
i used GOstats to check for over and under represented GO terms in my gene lists using a conditional test.
i then combined the three output files, i.e. the files for MF, BP and CC, into one list containing 175 under represented terms:
> head(GO.Mat,2) GOBPID Pvalue OddsRatio ExpCount Count Size 1 GO:0006260 2.78e-21 0.1447017 74.52345 14 314 2 GO:0034641 3.38e-20 0.6109688 713.01916 539 3019 Term test type subset 1 DNA replication under P allMat 2 cellular nitrogen compound metabolic process under P allMat > nrow(GO.Mat) [1] 175
then i pulled out all unique GO terms contained in the universe (16921 genes) resulting in 2472 terms:
> allGOterms<-subset(gU_frame.new, !(duplicated(gU_frame.new$frame.go_id)), select="frame.go_id") > nrow(allGOterms) [1] 2472
i now merged both lists, i.e. the list containing the 175 under represented terms (GO.Mat) and the unique GO terms (allGOterms) from my universe (the idea was to get an overview of the under represented terms in form of a heatmap). as result i was expecting to get 2472 entries in this merged list, however, this is not the case:
> nrow(allGOterms) #these are the unique GO terms from the universe [1] 2472 > nrow(GO.Mat) #this is the resluts list with the under rep. GOterms [1] 175 > allGOa.b<-merge(allGOterms, GO.Mat, by=1, all=T) > nrow(allGOa.b) [1] 2551
this leaves me with 79 GO terms in my result list (GO.Mat), which are not present in the universe. unless i fundamentally misunderstood the principle of GOstats, this doesn`t seem logic. does anybody know, why this happens? and how to deal with it?
To test for under and over representation i followed the guideline for unsupported organisms (https://bioconductor.statistik.tu-dortmund.de/packages/3.4/bioc/vignettes/GOstats/inst/doc/GOstatsForUnsupportedOrganisms.pdf).
thanks a lot
sven
> sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] grid stats4 parallel stats graphics grDevices datasets utils [9] methods base other attached packages: [1] GSEABase_1.34.1 Rgraphviz_2.16.0 xtable_1.8-2 RColorBrewer_1.1-2 [5] genefilter_1.54.2 annotate_1.50.1 XML_3.98-1.9 GO.db_3.3.0 [9] hgu95av2.db_3.2.3 org.Hs.eg.db_3.3.0 ALL_1.14.0 GOstats_2.40.0 [13] graph_1.50.0 Category_2.38.0 Matrix_1.2-12 AnnotationDbi_1.34.4 [17] IRanges_2.6.1 S4Vectors_0.10.3 Biobase_2.32.0 BiocGenerics_0.18.0 [21] installr_0.18.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.14 bitops_1.0-6 digest_0.6.12 [4] bit_1.1-12 RSQLite_2.0 memoise_1.1.0 [7] tibble_1.3.4 lattice_0.20-35 pkgconfig_2.0.1 [10] rlang_0.1.4 DBI_0.7 bit64_0.9-7 [13] RBGL_1.48.1 survival_2.41-3 blob_1.1.0 [16] splines_3.3.2 AnnotationForge_1.14.2 RCurl_1.95-4.8 >
addendum:
my GOStats code would look something like this:
cheers
Sven