I observed a discrepancy between the GO-term association from org.Hs.eg.db to the information given in the AMIGO2 portal; I would highly appreciate to know why and how to reconcile it. I found it in the term "viral transcription" (which happened to be enriched in my analysis and I wanted to examine it further), "GO:0019083".
In particular, my gene list had > 100 gene hits in this term, while the AMIGO2 portal at geneontology.org shows only 64 direct and indirect gene/gene product annotations (http://amigo.geneontology.org/amigo/search/annotation?q=*:*&fq=regulates_closure:%22GO:0019083%22&sfq=document_category:%22annotation%22 - filter for Organism = human). I went to investigate further into org.Hs.eg.db using the select() function.
library(org.Hs.eg.db)
GO_virtcn <- select(org.Hs.eg.db, keys="GO:0019083", columns=c("GO","ONTOLOGY","SYMBOL","EVIDENCE"),keytype="GO") #retrieve all associations to GO term 0019083
length(unique(GO_virtcn$SYMBOL)) #more than 100 direct annotations are reported in this database
According to the manual, the org.Hs.eg.db object only contains direct annotations. What striked me most was that a bunch of those genes are simply ribosomal proteins (e.g. RPL10 - RPL18), which I would not per se consider being associated to viral forms of gene expression; and that genes associated to a child GO term, positive regulation by host of viral transcription, GO:0043923, did NOT occur in the list of genes:
GO_posregvirtcn <- select(org.Hs.eg.db, keys="GO:0043923", columns=c("GO","ONTOLOGY","SYMBOL","EVIDENCE"),keytype="GO")
sum(GO_posregvirtcn$SYMBOL %in% GO_virtcn$SYMBOL) #not a single overlap in genes
To confirm the information from the GO homepage on lower number of genes for viral transcription, I downloaded the annotation file goa_human.gaf.gz from http://www.geneontology.org/page/download-go-annotations, read it into R and queried the direct annotations for "GO:0019083":
go_ont_ebi <- read.table("P:\\goa_human.gaf",sep="\t",comment.char="",quote="",skip=23,stringsAsFactors = FALSE)
go_ont_ebi[go_ont_ebi$V5=="GO:0019083",] #this is empty --> no direct annotation for viral transcription
go_ont_ebi[go_ont_ebi$V5=="GO:0043923",] #there we find 18 annotations
There was not a single direct annotation for viral transcription (as indicated on the GO homepage; all 64 gene products were indirectly associated), but 18 genes were associated to the daughter term (which was reported to have direct association to 19 genes on the GO webpage; this discrepancy would be acceptable for me).
Why do I get this (apparent?) difference in gene association to the GO term viral transcription between the GO-delivered database and org.Hs.eg.db? And could this happen also for other GO terms?
Thank you very much in advance for support in solving this!
Relevant output from sessionInfo():
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DBI_1.0.0 RSQLite_2.1.1 mgsa_1.28.0 gplots_3.0.1 org.Hs.eg.db_3.6.0 GO.db_3.6.0
[7] AnnotationDbi_1.42.1 IRanges_2.14.10 S4Vectors_0.18.3 Biobase_2.40.0 BiocGenerics_0.26.0 clusterProfiler_3.8.1
PS: I also observed >100 gene hits in viral transcription in GO overrepresentation analysis with the database packages (GO.db 3.4.0, org.Hs.eg.db 3.4.0) releases from 2016, so probably not an issue of very recent builts.
I thank you very much for this explanation! From the viewpoint of R, it is comprehensive, org.Hs.eg.db is really not to blame.
I guess I will have to ask at NCBI directly then - getting at all non-zero direct annotations for GO terms which originally had zero direct annotations is at least irritating (as this is really hard to be explained by any conversion of gene IDs from or to whatsoever only ;). Examining further, I found discrepancies in gene count of more than 5 genes (on the SYMBOL level) for more than 500 GO terms among the ca. 12500 GO BP terms...
I should point out that the GO hypergeometric test doesn't use the direct mappings anyway. By definition, all genes that are annotated to any progeny term are also annotated to a given term (e.g., all the genes that are annotated to positive regulation by host of viral transcription are also annotated to viral transcription, because the former is a subset of the latter).
When you do a GO hypergeometric test, it uses GOALL, not GO. If you use a conditional test there may be some genes that map to progeny terms that get removed because they gave rise to a significant result at a lower level of the GO DAG, but for an unconditional test you expect the count of genes that are in a given term to be the sum of all the progeny term genes, plus the direct associations.