Question

Gene-GO-term relationship discrepancy between org.Hs.eg.db and geneontology.org

0

Entering edit mode

katharina.baum • 0

@katharinabaum-16998

Last seen 5.7 years ago

I observed a discrepancy between the GO-term association from org.Hs.eg.db to the information given in the AMIGO2 portal; I would highly appreciate to know why and how to reconcile it. I found it in the term "viral transcription" (which happened to be enriched in my analysis and I wanted to examine it further), "GO:0019083".

In particular, my gene list had > 100 gene hits in this term, while the AMIGO2 portal at geneontology.org shows only 64 direct and indirect gene/gene product annotations (http://amigo.geneontology.org/amigo/search/annotation?q=*:*&fq=regulates_closure:%22GO:0019083%22&sfq=document_category:%22annotation%22 - filter for Organism = human). I went to investigate further into org.Hs.eg.db using the select() function.

library(org.Hs.eg.db)

GO_virtcn <- select(org.Hs.eg.db, keys="GO:0019083", columns=c("GO","ONTOLOGY","SYMBOL","EVIDENCE"),keytype="GO") #retrieve all associations to GO term 0019083
length(unique(GO_virtcn$SYMBOL)) #more than 100 direct annotations are reported in this database

According to the manual, the org.Hs.eg.db object only contains direct annotations. What striked me most was that a bunch of those genes are simply ribosomal proteins (e.g. RPL10 - RPL18), which I would not per se consider being associated to viral forms of gene expression; and that genes associated to a child GO term, positive regulation by host of viral transcription, GO:0043923, did NOT occur in the list of genes:

GO_posregvirtcn <- select(org.Hs.eg.db, keys="GO:0043923", columns=c("GO","ONTOLOGY","SYMBOL","EVIDENCE"),keytype="GO")
sum(GO_posregvirtcn$SYMBOL %in% GO_virtcn$SYMBOL) #not a single overlap in genes

To confirm the information from the GO homepage on lower number of genes for viral transcription, I downloaded the annotation file goa_human.gaf.gz from http://www.geneontology.org/page/download-go-annotations, read it into R and queried the direct annotations for "GO:0019083":

go_ont_ebi <- read.table("P:\\goa_human.gaf",sep="\t",comment.char="",quote="",skip=23,stringsAsFactors = FALSE)
go_ont_ebi[go_ont_ebi$V5=="GO:0019083",] #this is empty --> no direct annotation for viral transcription
go_ont_ebi[go_ont_ebi$V5=="GO:0043923",] #there we find 18 annotations

There was not a single direct annotation for viral transcription (as indicated on the GO homepage; all 64 gene products were indirectly associated), but 18 genes were associated to the daughter term (which was reported to have direct association to 19 genes on the GO webpage; this discrepancy would be acceptable for me).

Why do I get this (apparent?) difference in gene association to the GO term viral transcription between the GO-delivered database and org.Hs.eg.db? And could this happen also for other GO terms?

Thank you very much in advance for support in solving this!

Relevant output from sessionInfo():

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DBI_1.0.0             RSQLite_2.1.1         mgsa_1.28.0           gplots_3.0.1          org.Hs.eg.db_3.6.0    GO.db_3.6.0          
 [7] AnnotationDbi_1.42.1  IRanges_2.14.10       S4Vectors_0.18.3      Biobase_2.40.0        BiocGenerics_0.26.0   clusterProfiler_3.8.1

PS: I also observed >100 gene hits in viral transcription in GO overrepresentation analysis with the database packages (GO.db 3.4.0, org.Hs.eg.db 3.4.0) releases from 2016, so probably not an issue of very recent builts.

org.hs.eg.db • 2.3k views

ADD COMMENT • link updated 6.7 years ago by James W. MacDonald 68k • written 6.7 years ago by katharina.baum • 0

score 3 · Accepted Answer · 2018-08-22

The org.Hs.eg.db package provides mappings from NCBI Gene IDs to all other annotations, including GO. These mappings are based on data that we download from NCBI, and are simply repackaged into the OrgDb packages to make things easier for our end users. We don't do any vetting of the data, and assume that NCBI are competently doing their jobs.

The file that we use to map Gene IDs to GO is the gene2go file found here. The org.Hs.eg.db package you are using is based on data we downloaded in late March or early April, so is somewhat out of date (NCBI updates these files on a daily or weekly basis; in the interest of reproducibility we downlaod and build annotation packages semi-annually, so we make a trade-off between being absolutely current and having some stability in what's available). Anyway...

Here is what NCBI says about that file:

===========================================================================
gene2go                                         recalculated daily
---------------------------------------------------------------------------
           This file reports the GO terms that have been associated
            with Genes in Entrez Gene.
           It is generated by processing the gene_association files
            on the GO ftp site: 
            http://www.geneontology.org/GO.current.annotations.shtml
           and comparing the DB_Object_ID to annotation in Gene,
           as also reported in gene_info.gz

           Multiple gene_associations file may be used for any genome.
           If so, duplicate information is not reported; but unique
           contributions of GO terms, evidence codes, and citations are.
 
           The file that is used to establish the rules for 
           the files and fields that are used for each taxon 
           is documented in go_process.xml

Here's what is in that file. First the header

$ zcat gene2go.gz | head -n 3
#tax_id    GeneID    GO_ID    Evidence    Qualifier    GO_term    PubMed    Category
3702    814629    GO:0005634    ISM    -    nucleus    -    Component
3702    814629    GO:0008150    ND    -    biological_process    -    Process

Now for the GO term you are interested in

zcat gene2go.gz | awk '{if($1 == 9606 && $3 == "GO:0019083") print $0}' - | head -n 3
9606    3921    GO:0019083    TAS    -    viral transcription    -    Process
9606    4736    GO:0019083    TAS    -    viral transcription    -    Process
9606    4927    GO:0019083    TAS    -    viral transcription    -    Process

And finally the total number of such mappings

$ zcat gene2go.gz | awk '{if($1 == 9606 && $3 == "GO:0019083") print $0}' - | wc -l
108

Which agrees with what you are getting from the org.Hs.eg.db package. This may not correspond with what you see on the AmiGO site, but that really comes down to a disagreement between NCBI and the folks at geneontology, if in fact such a disagreement exists. What is meant by 'processing the gene_association files on the GO ftp site' is probably a pretty complex process, particularly given that the associations on GO are with UniProt and NCBI is mapping to Gene ID.