Question

Differences in ENSEMBL => ENTREZID mapping coverage between org.Hs.eg.db and AnnotationHub resources

0

Entering edit mode

sandmann.t ▴ 70

@sandmannt-11014

Last seen 19 months ago

United States

I would like to create a data.frame with both ENSEMBL and ENTREZID gene identifiers for several species, including the human and the Macaca fascicularis genomes.

I noticed that the latest EnsDb object available (AH100643 = ensemble version 106) for the human genome only contains ENTREZIDs for ~ 40% of the ENSEMBL gene identifiers. (See reproducible example below.) In contrast, the org.Hs.eg.db organism annotation package has ENSEMBL => ENTREZID mappings for ~ 55% of (the same) ENSEMBL genes.

For the human genome, I can chose to use the org.Hs.eg.db package as the source of annotations (because it seems to be more comprehensive). But for Macaca fascicularis, I don't have that option because no annotation package exists in Bioconductor 3.15.

There is the Macaca fascicularis AH100664 resource (ensembl v106) in the AnnotationHub. Yet, based on my experience with the human annotations, I am wondering if I am missing ENSEMBL => ENTREZID mappings by relying on it?

Is there a better way to retrieve the most comprehensive gene identifier mappings for this species? Or perhaps I am not using the EnsDb object from the AnnotationHub provides correctly?

Many thanks for any pointers and suggestions!

Reproducible example

library(AnnotationDbi)
library(AnnotationHub)
library(ensembldb)
library(org.Hs.eg.db)

kResource <- "AH100643" # ensembl v106, Homo sapiens
kCacheDir <- getAnnotationHubOption("CACHE")

dir.create(kCacheDir, showWarnings = FALSE, recursive = TRUE)
ah = AnnotationHub(cache = kCacheDir)
# snapshotDate(): 2022-04-21

ah[kResource]

# AnnotationHub with 1 record
# snapshotDate(): 2022-04-21
# names(): AH100643
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2022-04-21
# $title: Ensembl 106 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensembl ...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("106", "Annotation", "AnnotationHubSoftware", "Coverage",
#   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#   "Transcript") 
# retrieve record with 'object[["AH100643"]]' 

ensdb <- ah[[kResource]]
gene_ids <- select(ensdb, keys(ensdb), 
                   columns = "ENTREZID",
                   keytype = "GENEID")

# ~ 60% of the ENSEMBL gene identifiers are not mapped to ENTREZIDs
table(is.na(gene_ids$ENTREZID))
# FALSE  TRUE 
# 27312 42296 

# for example ENSEMBL gene ENSG00000001629 (ANKIB1) without an ENTREZID
gene_ids[gene_ids$GENEID == "ENSG00000001629", ]
# GENEID ENTREZID
# 17 ENSG00000001629       NA

# in contrast, the org.Hs.eg.db Bioconductor annotation package
# maps many more ensembl identifiers to their ENTREZIDs
from_bioc <- select(org.Hs.eg.db, keys(ensdb),
                    columns = "ENTREZID", keytype = "ENSEMBL")
table(is.na(from_bioc$ENTREZID))
# FALSE  TRUE 
# 39007 30705 

# For example, ENSEMBL gene ENSG00000001629 is mapped in org.Hs.eg.db
select(org.Hs.eg.db, "ENSG00000001629", columns = "ENTREZID",
       keytype = "ENSEMBL")
# ENTREZID         ENSEMBL
# 1    54467 ENSG00000001629

Session Info

R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.5

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.15.0     ensembldb_2.20.2        AnnotationFilter_1.20.0
 [4] GenomicFeatures_1.48.3  GenomicRanges_1.48.0    GenomeInfoDb_1.32.2    
 [7] AnnotationHub_3.4.0     BiocFileCache_2.4.0     dbplyr_2.2.1           
[10] AnnotationDbi_1.58.0    IRanges_2.30.0          S4Vectors_0.34.0       
[13] Biobase_2.56.0          BiocGenerics_0.42.0    

loaded via a namespace (and not attached):
 [1] MatrixGenerics_1.8.1          httr_1.4.3                   
 [3] bit64_4.0.5                   shiny_1.7.2                  
 [5] assertthat_0.2.1              interactiveDisplayBase_1.34.0
 [7] BiocManager_1.30.18           blob_1.2.3                   
 [9] renv_0.15.5                   GenomeInfoDbData_1.2.8       
[11] Rsamtools_2.12.0              yaml_2.3.5                   
[13] progress_1.2.2                BiocVersion_3.15.2           
[15] lattice_0.20-45               pillar_1.7.0                 
[17] RSQLite_2.2.15                glue_1.6.2                   
[19] digest_0.6.29                 promises_1.2.0.1             
[21] XVector_0.36.0                Matrix_1.4-1                 
[23] htmltools_0.5.2               httpuv_1.6.5                 
[25] XML_3.99-0.10                 pkgconfig_2.0.3              
[27] biomaRt_2.52.0                zlibbioc_1.42.0              
[29] purrr_0.3.4                   xtable_1.8-4                 
[31] later_1.3.0                   BiocParallel_1.30.3          
[33] tibble_3.1.7                  KEGGREST_1.36.2              
[35] generics_0.1.3                ellipsis_0.3.2               
[37] SummarizedExperiment_1.26.1   cachem_1.0.6                 
[39] lazyeval_0.2.2                cli_3.3.0                    
[41] magrittr_2.0.3                crayon_1.5.1                 
[43] mime_0.12                     memoise_2.0.1                
[45] fansi_1.0.3                   xml2_1.3.3                   
[47] tools_4.2.1                   prettyunits_1.1.1            
[49] hms_1.1.1                     matrixStats_0.62.0           
[51] BiocIO_1.6.0                  lifecycle_1.0.1              
[53] stringr_1.4.0                 DelayedArray_0.22.0          
[55] Biostrings_2.64.0             compiler_4.2.1               
[57] rlang_1.0.3                   grid_4.2.1                   
[59] RCurl_1.98-1.7                rstudioapi_0.13              
[61] rjson_0.2.21                  rappdirs_0.3.3               
[63] bitops_1.0-7                  restfulr_0.0.15              
[65] codetools_0.2-18              DBI_1.1.3                    
[67] curl_4.3.2                    R6_2.5.1                     
[69] GenomicAlignments_1.32.0      dplyr_1.0.9                  
[71] rtracklayer_1.56.1            fastmap_1.1.0                
[73] bit_4.0.4                     utf8_1.2.2                   
[75] filelock_1.0.2                ProtGenerics_1.28.0          
[77] stringi_1.7.8                 parallel_4.2.1               
[79] Rcpp_1.0.9                    vctrs_0.4.1                  
[81] png_0.1-7                     tidyselect_1.1.2

AnnotationHub org.Hs.eg.db ensembldb • 2.2k views

ADD COMMENT • link 2.7 years ago sandmann.t ▴ 70

score 0 · Answer 1 · 2022-08-01

This is a longstanding issue that has to do with how each annotation service defines what a gene is, where they are, and whether or not a given gene in the same relative position for both services is in fact the same gene. Actually, it's the transcripts, as a gene is a simplification of the underlying transcripts and is in some sense not really a thing.

Anyway, NCBI and EBI have been working on something called MANE, where they are trying to find one (!) transcript per gene that they think is the same thing for both annotation services. It's taken years, which is a testament to how hard it is to do and is also a clear indication of what we really know about the genome and genes and whatnot. You can get the data from either NCBI or EBI - I'll get it from NCBI.

> download.file("https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.summary.txt.gz", "MANE.GRCh38.v1.0.summary.txt.gz")
trying URL 'https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.0/MANE.GRCh38.v1.0.summary.txt.gz'
Content type 'application/x-gzip' length 1083306 bytes (1.0 MB)
downloaded 1.0 MB

## can't read in directly because there are embedded nulls
> system("gzip -d MANE.GRCh38.v1.0.summary.txt.gz")
[1] 0
> z <- read.delim("MANE.GRCh38.v1.0.summary.txt")
> head(z)
  X.NCBI_GeneID       Ensembl_Gene   HGNC_ID   symbol                      name
1      GeneID:1 ENSG00000121410.12    HGNC:5     A1BG    alpha-1-B glycoprotein
2      GeneID:2 ENSG00000175899.15    HGNC:7      A2M     alpha-2-macroglobulin
3      GeneID:9 ENSG00000171428.15 HGNC:7645     NAT1     N-acetyltransferase 1
4     GeneID:10  ENSG00000156006.5 HGNC:7646     NAT2     N-acetyltransferase 2
5     GeneID:12 ENSG00000196136.18   HGNC:16 SERPINA3  serpin family A member 3
6     GeneID:13 ENSG00000114771.14   HGNC:17    AADAC arylacetamide deacetylase
   RefSeq_nuc RefSeq_prot        Ensembl_nuc      Ensembl_prot MANE_status
1 NM_130786.4 NP_570602.2  ENST00000263100.8 ENSP00000263100.2 MANE Select
2 NM_000014.6 NP_000005.3 ENST00000318602.12 ENSP00000323929.8 MANE Select
3 NM_000662.8 NP_000653.3  ENST00000307719.9 ENSP00000307218.4 MANE Select
4 NM_000015.3 NP_000006.2  ENST00000286479.4 ENSP00000286479.3 MANE Select
5 NM_001085.5 NP_001076.2  ENST00000393078.5 ENSP00000376793.3 MANE Select
6 NM_001086.3 NP_001077.2 ENST00000232892.12 ENSP00000232892.6 MANE Select
  GRCh38_chr chr_start   chr_end chr_strand
1         19  58345183  58353492          -
2         12   9067708   9115919          -
3          8  18210109  18223689          +
4          8  18391282  18401218          +
5         14  94612391  94624053          +
6          3 151814116 151828488          +

That's probably not complete, but it will most likely be superior to what you get from the OrgDb or EnsDb packages.