I would like to create a data.frame with both ENSEMBL and ENTREZID gene identifiers for several species, including the human and the Macaca fascicularis genomes.
I noticed that the latest EnsDb object available (AH100643
= ensemble version 106) for the human genome only contains ENTREZIDs for ~ 40% of the ENSEMBL gene identifiers. (See reproducible example below.) In contrast, the org.Hs.eg.db
organism annotation package has ENSEMBL => ENTREZID mappings for ~ 55% of (the same) ENSEMBL genes.
For the human genome, I can chose to use the org.Hs.eg.db
package as the source of annotations (because it seems to be more comprehensive). But for Macaca fascicularis, I don't have that option because no annotation package exists in Bioconductor 3.15.
There is the Macaca fascicularis AH100664
resource (ensembl v106) in the AnnotationHub. Yet, based on my experience with the human annotations, I am wondering if I am missing ENSEMBL => ENTREZID mappings by relying on it?
Is there a better way to retrieve the most comprehensive gene identifier mappings for this species? Or perhaps I am not using the EnsDb object from the AnnotationHub provides correctly?
Many thanks for any pointers and suggestions!
Reproducible example
library(AnnotationDbi)
library(AnnotationHub)
library(ensembldb)
library(org.Hs.eg.db)
kResource <- "AH100643" # ensembl v106, Homo sapiens
kCacheDir <- getAnnotationHubOption("CACHE")
dir.create(kCacheDir, showWarnings = FALSE, recursive = TRUE)
ah = AnnotationHub(cache = kCacheDir)
# snapshotDate(): 2022-04-21
ah[kResource]
# AnnotationHub with 1 record
# snapshotDate(): 2022-04-21
# names(): AH100643
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2022-04-21
# $title: Ensembl 106 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensembl ...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("106", "Annotation", "AnnotationHubSoftware", "Coverage",
# "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
# "Transcript")
# retrieve record with 'object[["AH100643"]]'
ensdb <- ah[[kResource]]
gene_ids <- select(ensdb, keys(ensdb),
columns = "ENTREZID",
keytype = "GENEID")
# ~ 60% of the ENSEMBL gene identifiers are not mapped to ENTREZIDs
table(is.na(gene_ids$ENTREZID))
# FALSE TRUE
# 27312 42296
# for example ENSEMBL gene ENSG00000001629 (ANKIB1) without an ENTREZID
gene_ids[gene_ids$GENEID == "ENSG00000001629", ]
# GENEID ENTREZID
# 17 ENSG00000001629 NA
# in contrast, the org.Hs.eg.db Bioconductor annotation package
# maps many more ensembl identifiers to their ENTREZIDs
from_bioc <- select(org.Hs.eg.db, keys(ensdb),
columns = "ENTREZID", keytype = "ENSEMBL")
table(is.na(from_bioc$ENTREZID))
# FALSE TRUE
# 39007 30705
# For example, ENSEMBL gene ENSG00000001629 is mapped in org.Hs.eg.db
select(org.Hs.eg.db, "ENSG00000001629", columns = "ENTREZID",
keytype = "ENSEMBL")
# ENTREZID ENSEMBL
# 1 54467 ENSG00000001629
Session Info
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.5
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods base
other attached packages:
[1] org.Hs.eg.db_3.15.0 ensembldb_2.20.2 AnnotationFilter_1.20.0
[4] GenomicFeatures_1.48.3 GenomicRanges_1.48.0 GenomeInfoDb_1.32.2
[7] AnnotationHub_3.4.0 BiocFileCache_2.4.0 dbplyr_2.2.1
[10] AnnotationDbi_1.58.0 IRanges_2.30.0 S4Vectors_0.34.0
[13] Biobase_2.56.0 BiocGenerics_0.42.0
loaded via a namespace (and not attached):
[1] MatrixGenerics_1.8.1 httr_1.4.3
[3] bit64_4.0.5 shiny_1.7.2
[5] assertthat_0.2.1 interactiveDisplayBase_1.34.0
[7] BiocManager_1.30.18 blob_1.2.3
[9] renv_0.15.5 GenomeInfoDbData_1.2.8
[11] Rsamtools_2.12.0 yaml_2.3.5
[13] progress_1.2.2 BiocVersion_3.15.2
[15] lattice_0.20-45 pillar_1.7.0
[17] RSQLite_2.2.15 glue_1.6.2
[19] digest_0.6.29 promises_1.2.0.1
[21] XVector_0.36.0 Matrix_1.4-1
[23] htmltools_0.5.2 httpuv_1.6.5
[25] XML_3.99-0.10 pkgconfig_2.0.3
[27] biomaRt_2.52.0 zlibbioc_1.42.0
[29] purrr_0.3.4 xtable_1.8-4
[31] later_1.3.0 BiocParallel_1.30.3
[33] tibble_3.1.7 KEGGREST_1.36.2
[35] generics_0.1.3 ellipsis_0.3.2
[37] SummarizedExperiment_1.26.1 cachem_1.0.6
[39] lazyeval_0.2.2 cli_3.3.0
[41] magrittr_2.0.3 crayon_1.5.1
[43] mime_0.12 memoise_2.0.1
[45] fansi_1.0.3 xml2_1.3.3
[47] tools_4.2.1 prettyunits_1.1.1
[49] hms_1.1.1 matrixStats_0.62.0
[51] BiocIO_1.6.0 lifecycle_1.0.1
[53] stringr_1.4.0 DelayedArray_0.22.0
[55] Biostrings_2.64.0 compiler_4.2.1
[57] rlang_1.0.3 grid_4.2.1
[59] RCurl_1.98-1.7 rstudioapi_0.13
[61] rjson_0.2.21 rappdirs_0.3.3
[63] bitops_1.0-7 restfulr_0.0.15
[65] codetools_0.2-18 DBI_1.1.3
[67] curl_4.3.2 R6_2.5.1
[69] GenomicAlignments_1.32.0 dplyr_1.0.9
[71] rtracklayer_1.56.1 fastmap_1.1.0
[73] bit_4.0.4 utf8_1.2.2
[75] filelock_1.0.2 ProtGenerics_1.28.0
[77] stringi_1.7.8 parallel_4.2.1
[79] Rcpp_1.0.9 vctrs_0.4.1
[81] png_0.1-7 tidyselect_1.1.2
There are 260 genes that NCBI can't/hasn't yet mapped to Ensembl
Wow, thank you so much for helping me understand the relationships between the different annotation sources, and for pointing me to MANE. _Super_ helpful and much appreciated!