Question

org.Hs.eg.db error same ensembl ID in different genes

1

Entering edit mode

fengshou.ma ▴ 10

@fengshouma-13891

Last seen 7.6 years ago

select(org.Hs.eg.db, keys = "MIR15A", keytype = "SYMBOL", columns = c("SYMBOL","GENENAME","ENSEMBL"))
'select()' returned 1:1 mapping between keys and columns
  SYMBOL     GENENAME         ENSEMBL
1 MIR15A microRNA 15a ENSG00000231607

select(org.Hs.eg.db, keys = "DLEU2", keytype = "SYMBOL", columns = c("SYMBOL","GENENAME","ENSEMBL"))
'select()' returned 1:1 mapping between keys and columns
  SYMBOL                                               GENENAME         ENSEMBL
1  DLEU2 deleted in lymphocytic leukemia 2 (non-protein coding) ENSG00000231607

But the ensembl id of MIR15A is ENSG00000283785 not ENSG00000231607.

R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936    LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C                               LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] org.Hs.eg.db_3.4.1    AnnotationDbi_1.38.2  IRanges_2.10.2        S4Vectors_0.14.3      Biobase_2.36.2        BiocGenerics_0.22.0  
[7] clusterProfiler_3.4.4 DOSE_3.2.0           

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12        compiler_3.4.0      plyr_1.8.4          tools_3.4.0         digest_0.6.12       bit_1.1-12          RSQLite_2.0        
 [8] memoise_1.1.0       tibble_1.3.3        gtable_0.2.0        pkgconfig_2.0.1     rlang_0.1.2         fastmatch_1.1-0     igraph_1.1.2       
[15] DBI_0.7             rvcheck_0.0.9       fgsea_1.2.1         gridExtra_2.2.1     stringr_1.2.0       bit64_0.9-7         grid_3.4.0         
[22] glue_1.1.1          qvalue_2.8.0        data.table_1.10.4   BiocParallel_1.10.1 GOSemSim_2.2.0      purrr_0.2.3         tidyr_0.7.0        
[29] GO.db_3.4.1         DO.db_2.9           ggplot2_2.2.1       reshape2_1.4.2      blob_1.1.0          magrittr_1.5        splines_3.4.0      
[36] scales_0.4.1        colorspace_1.3-2    stringi_1.1.5       lazyeval_0.2.0      munsell_0.4.3

org.hs.eg.db • 2.4k views

ADD COMMENT • link updated 7.6 years ago by daniel.vantwisk ▴ 50 • written 7.6 years ago by fengshou.ma ▴ 10

score 2 · Answer 1 · 2017-09-05

If you're working with Ensembl annotations I would stick to annotation resources that were built on Ensembl provided data (such as biomaRt or ensembldb). This also avoids potential problems and multi-mappings between Ensembl and NCBI. AFAIK the .eg. packages are built using information from NCBI and some discrepancies might be explained by the mapping between the databases (NCBI <-> Ensembl).

cheers, jo

The mapping using ensembldb:

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2017-04-25
> query(ah, c("EnsDb", "Hsapiens"))
AnnotationHub with 2 records
# snapshotDate(): 2017-04-25
# $dataprovider: Ensembl
# $species: Homo Sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'
            title                            
  AH53211 | Ensembl 87 EnsDb for Homo Sapiens
  AH53715 | Ensembl 88 EnsDb for Homo Sapiens
> edb <- ah[["AH53715"]]
loading from cache '/Users/jo//.AnnotationHub/60453'
> select(edb, keys = "MIR15A", keytype = "SYMBOL", columns = c("SYMBOL","DESCRIPTION","GENEID"))
  SYMBOL                                      DESCRIPTION          GENEID
1 MIR15A microRNA 15a [Source:HGNC Symbol;Acc:HGNC:31543] ENSG00000283785

score 1 · Answer 2 · 2017-09-06

The mapping is not incorrect, but based on older resources from NCBI. The most recent version of org.Hs.eg.db(3.4.1) was built from resources from NCBI on March 29th 2017. We do not continuously rebuild our annotation resources so that we can allow researchers who are using these resources to get reproducible results (so that the results from using an annotation package does not change from day to day). If you are looking for a more up-to-date way of obtaining annotation information, you can use a Bioconductor package that accesses NCBI's API. Below I've included two examples. The first shows the date that the annotation resource was built for an org package. The second shows a method of obtaining the most up-to-date annotation information using biomaRt.

Here, the attribute EGSOURCEDATE shows the date the annotation information was obtained from NCBI to build the pacakge.

library(org.Hs.eg.db)

org.Hs.eg.db
#> OrgDb object:
#> | DBSCHEMAVERSION: 2.1
#> | Db type: OrgDb
#> | Supporting package: AnnotationDbi
#> | DBSCHEMA: HUMAN_DB
#> | ORGANISM: Homo sapiens
#> | SPECIES: Human
#> | EGSOURCEDATE: 2017-Mar29
#> | EGSOURCENAME: Entrez Gene
#> | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
#> | CENTRALID: EG
#> | TAXID: 9606
#> | GOSOURCENAME: Gene Ontology
#> | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
#> | GOSOURCEDATE: 2017-Mar29
#> | GOEGSOURCEDATE: 2017-Mar29
#> | GOEGSOURCENAME: Entrez Gene
#> | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
#> | KEGGSOURCENAME: KEGG GENOME
#> | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
#> | KEGGSOURCEDATE: 2011-Mar15
#> | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
#> | GPSOURCEURL: 
#> | GPSOURCEDATE: 2017-Mar17
#> | ENSOURCEDATE: 2017-Mar29
#> | ENSOURCENAME: Ensembl
#> | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
#> | UPSOURCENAME: Uniprot
#> | UPSOURCEURL: http://www.UniProt.org/
#> | UPSOURCEDATE: Wed Apr  5 02:52:37 2017
#> 
#> Please see: help('select') for usage information

Here, we use biomaRt to obtain the most up-to-date annotation information.

library(biomaRt)
ensembl <- useMart("ensembl")
ensembl <- useDataset("hsapiens_gene_ensembl",mart=ensembl)
getBM(attributes=c("hgnc_symbol", "ensembl_gene_id"),
    filters=c('hgnc_symbol'),
    values= c('MIR15A','DLEU2'),
    mart=ensembl)
#>   hgnc_symbol ensembl_gene_id
#> 1       DLEU2 ENSG00000231607
#> 2      MIR15A ENSG00000283785

score 0 · Answer 3 · 2017-09-04

0

Entering edit mode

fengshou.ma ▴ 10

@fengshouma-13891

Last seen 7.6 years ago

It seems that all miRNA's ensembl id is wrong.

ADD COMMENT • link 7.6 years ago fengshou.ma ▴ 10