Question

GO Stats results contain GO terms NOT present in the gene universe

1

Entering edit mode

sven.schenk ▴ 30

@svenschenk-15837

Last seen 6.0 years ago

Vienna

hi,

i used GOstats to check for over and under represented GO terms in my gene lists using a conditional test.

i then combined the three output files, i.e. the files for MF, BP and CC, into one list containing 175 under represented terms:

> head(GO.Mat,2)
      GOBPID   Pvalue OddsRatio  ExpCount Count Size
1 GO:0006260 2.78e-21 0.1447017  74.52345    14  314
2 GO:0034641 3.38e-20 0.6109688 713.01916   539 3019
                                          Term      test type subset
1                              DNA replication under    P allMat
2 cellular nitrogen compound metabolic process  under    P allMat

> nrow(GO.Mat)
[1] 175

then i pulled out all unique GO terms contained in the universe (16921 genes) resulting in 2472 terms:

> allGOterms<-subset(gU_frame.new, !(duplicated(gU_frame.new$frame.go_id)), select="frame.go_id")
> nrow(allGOterms)
[1] 2472

i now merged both lists, i.e. the list containing the 175 under represented terms (GO.Mat) and the unique GO terms (allGOterms) from my universe (the idea was to get an overview of the under represented terms in form of a heatmap). as result i was expecting to get 2472 entries in this merged list, however, this is not the case:

> nrow(allGOterms)  #these are the unique GO terms from the universe
[1] 2472

> nrow(GO.Mat)   #this is the resluts list with the under rep. GOterms
[1] 175

> allGOa.b<-merge(allGOterms, GO.Mat, by=1, all=T)
> nrow(allGOa.b)
[1] 2551

this leaves me with 79 GO terms in my result list (GO.Mat), which are not present in the universe. unless i fundamentally misunderstood the principle of GOstats, this doesn`t seem logic. does anybody know, why this happens? and how to deal with it?

To test for under and over representation i followed the guideline for unsupported organisms (https://bioconductor.statistik.tu-dortmund.de/packages/3.4/bioc/vignettes/GOstats/inst/doc/GOstatsForUnsupportedOrganisms.pdf).

thanks a lot

sven

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices datasets  utils    
 [9] methods   base     

other attached packages:
 [1] GSEABase_1.34.1      Rgraphviz_2.16.0     xtable_1.8-2         RColorBrewer_1.1-2  
 [5] genefilter_1.54.2    annotate_1.50.1      XML_3.98-1.9         GO.db_3.3.0         
 [9] hgu95av2.db_3.2.3    org.Hs.eg.db_3.3.0   ALL_1.14.0           GOstats_2.40.0      
[13] graph_1.50.0         Category_2.38.0      Matrix_1.2-12        AnnotationDbi_1.34.4
[17] IRanges_2.6.1        S4Vectors_0.10.3     Biobase_2.32.0       BiocGenerics_0.18.0
[21] installr_0.18.0     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14           bitops_1.0-6           digest_0.6.12         
 [4] bit_1.1-12             RSQLite_2.0            memoise_1.1.0         
 [7] tibble_1.3.4           lattice_0.20-35        pkgconfig_2.0.1       
[10] rlang_0.1.4            DBI_0.7                bit64_0.9-7           
[13] RBGL_1.48.1            survival_2.41-3        blob_1.1.0            
[16] splines_3.3.2          AnnotationForge_1.14.2 RCurl_1.95-4.8        
>

gostats • 1.7k views

ADD COMMENT • link updated 6.9 years ago by emmascottgm • 0 • written 6.9 years ago by sven.schenk ▴ 30

0

Entering edit mode

addendum:

my GOStats code would look something like this:

>goFrame_gU.new.H = GOFrame(gU_frame.new, organism="Homo sapiens")
>goALLFrame.gU.new.H = GOAllFrame(goFrame_gU.new.H)

>ids <-c("comp2258926_c0_seq3","comp2258476_c0_seq2","comp2246178_c0_seq1","comp2268190_c1_seq4")

>universe<-as.character(gU_frame.new$frame.gene_id)

>genes<-ids

>params.BP_under <- GSEAGOHyperGParams(name="Under_Represented BP_Conditional Params",
geneSetCollection=gsc.gU.H,
geneIds = genes,
universeGeneIds = universe,
ontology = "BP",
pvalueCutoff = 0.05,
conditional = TRUE,
testDirection = "under")

>genes.BP_under <- hyperGTest(params.BP_under)
genes.BP_under
>x<-summary(genes.BP_under)

cheers

Sven

ADD REPLY • link 6.9 years ago sven.schenk ▴ 30

score 1 · Answer 1 · 2018-05-17

You are using a very outdated version of R/Bioconductor. Do note that every release we update GO.db to contain the current set of IDs. If you have a temporal mismatch between what is in the GO.db you have installed and the source of GO IDs, then it's to be expected that things will be different. These things change, and more rapidly than you might imagine. The first step is to install a current version of R and Bioconductor, and see if that helps.

Martin Morgan · Answer 2 · 2018-05-18

0

Entering edit mode

sven.schenk ▴ 30

@svenschenk-15837

Last seen 6.0 years ago

Vienna

hi,

thanks for your answer. so i updated my GO.db from 3.3.0 to 3.6.0 and rerun the analysis, however the problem persits (it even gets more pronounced). so i think the problem is the time of annotaion of my transcriptome which was in July 16 and my old GO.db (3.3.0) was form October 16 wich led to the differences and mow with the new db version it gets even "worse". however, what i don`t understand is why this happens. i thought the "to-test-gene list" was tested against the universe, and if a GO term is not present in the universe, then it cannot appear in the result. i would understand if this to happe if i would provide sequences but that i don't do; i just provide sequence identifiers from a de nove transcriptome assembly.

so i guess i would have to check for an even older db than 3.3.0 to resolve the issue (or re-annotate the transcriptome).

cheers

sven

ADD COMMENT • link updated 6.9 years ago by Martin Morgan 25k • written 6.9 years ago by sven.schenk ▴ 30

0

Entering edit mode

We just (like days ago) changed the link below to say this:

If you want to provide a solution to the original question, click Add Answer .
Use the ADD COMMENT or ADD REPLY buttons to ask for clarifications, request more details, or respond to a previous answer or comment.

With the intention of keeping people from using the Add Answer button to add an additional question or comment, rather than the ADD COMMENT button. We even made ADD COMMENT like all green and big. Yet you used Add Answer rather than ADD COMMENT. So I'm interested - was that just from some force of habit, or did you just not read what it said there, or is this somehow confusing? We are really trying to get people to use the support site as intended, so it's helpful to know why people misuse it.

Now to your question. I previously said that GO is a moving target and terms come in and come out with some regularity, and said you probably needed to update your installation. Implicit in that (I thought) is the idea that this included updating the mappings you have between whatever IDs you started with, and the GO terms appended to those IDs. If you update the GO.db package yet continue to use old mappings for your transcripts are you really surprised that it gets even worse? That would be my expectation!

ADD REPLY • link 6.9 years ago James W. MacDonald 68k

0

Entering edit mode

I should also note that what you wrote implies to me that you simply updated the GO.db package, rather than the entirety of your R/Bioc installation. Do note that the whole premise of the Bioconductor project is to be able to supply a set of coherent packages that are all guaranteed to work together, and to that end we have an installer called biocLite that ensures that all the packages installed come from the same development cycle.

You are free to mix and match things as you wish, but in so doing you are signaling to us that you think you know what you are doing. That's perfectly fine, but it's hard enough for people to help those with conventional installations, and we cannot provide support for those who have unconventional installations (of which there are nearly a limitless number).

ADD REPLY • link 6.9 years ago James W. MacDonald 68k

0

Entering edit mode

hi again,

sorry for not using the COMMENT/REPLY button.

no, i did not just update GO.db, i updated the whole installation from 3.3.2 to 3.5.0 it wouldn't work otherwise. yes, you're right that it would be expected that i find more non-matching GO terms if i use a newer DB, however, even if i use a matching DB (e.g. October or April 2016) i.e. when i annotated my sequences the problem that i find GO terms in my result that weren`t in the universe are in the results persits.

i still think that i should only be able to find in the results what was also annotated in the universe, right?

cheers

sven

ADD REPLY • link 6.9 years ago sven.schenk ▴ 30

1

Entering edit mode

OK, I think I get what you are asking. You have this data.frame with GO terms that you are starting with, and when you do the hypergeometric test you end up with GO terms that weren't in that original data.frame, right? This is because you aren't using the data in your original data.frame any more. When you use GOFrame and GOAllFrame, you are making queries to the GO.db package to recreate the GO directed acyclic graph. This will add GO terms that might not have been in your original data.frame. As an example, we can use the example from ?GOFrame:

> example(GOFrame)

GOFram>   ## Make up an example
GOFram>   genes = c(1,10,100)

GOFram>   evi = c("ND","IEA","IDA")

GOFram>   GOIds = c("GO:0008150","GO:0008152","GO:0001666")

GOFram>   frameData = data.frame(cbind(GOIds,evi,genes))

GOFram>   library(AnnotationDbi)

GOFram>   frame=GOFrame(frameData,organism="Homo sapiens")

GOFram>   allFrame=GOAllFrame(frame)

GOFram>   getGOFrameData(allFrame)
        go_id evidence gene_id
1  GO:0001666      IDA     100
2  GO:0006950      IDA     100
3  GO:0008150      IDA     100
4  GO:0008150      IEA      10
5  GO:0008150       ND       1
6  GO:0008152      IEA      10
7  GO:0009628      IDA     100
8  GO:0036293      IDA     100
9  GO:0050896      IDA     100
10 GO:0070482      IDA     100

So we started with three GO IDs and gained five more.

ADD REPLY • link 6.9 years ago James W. MacDonald 68k

0

Entering edit mode

And note that the universe here isn't a set of GO IDs, but instead some other IDs. So talking about GO terms that aren't in the universe doesn't really make sense.

ADD REPLY • link 6.9 years ago James W. MacDonald 68k

0

Entering edit mode

hi,

thanks for your answer. i came to the same conclusion, when checking what`s happening during building the GOallframe - in this I have something like 4700 Go terms insteadt of 2472 in my input list. so it was indeed a misunderstanding how GOstats works.

thanks again for the answer and the additional explanation.

sven

ADD REPLY • link 6.9 years ago sven.schenk ▴ 30