org.Hs.eg.db error same ensembl ID in different genes
3
1
Entering edit mode
fengshou.ma ▴ 10
@fengshouma-13891
Last seen 7.3 years ago
select(org.Hs.eg.db, keys = "MIR15A", keytype = "SYMBOL", columns = c("SYMBOL","GENENAME","ENSEMBL"))
'select()' returned 1:1 mapping between keys and columns
  SYMBOL     GENENAME         ENSEMBL
1 MIR15A microRNA 15a ENSG00000231607

select(org.Hs.eg.db, keys = "DLEU2", keytype = "SYMBOL", columns = c("SYMBOL","GENENAME","ENSEMBL"))
'select()' returned 1:1 mapping between keys and columns
  SYMBOL                                               GENENAME         ENSEMBL
1  DLEU2 deleted in lymphocytic leukemia 2 (non-protein coding) ENSG00000231607


But the ensembl id of MIR15A is  ENSG00000283785  not ENSG00000231607.

 

 

R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936    LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C                               LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] org.Hs.eg.db_3.4.1    AnnotationDbi_1.38.2  IRanges_2.10.2        S4Vectors_0.14.3      Biobase_2.36.2        BiocGenerics_0.22.0  
[7] clusterProfiler_3.4.4 DOSE_3.2.0           

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12        compiler_3.4.0      plyr_1.8.4          tools_3.4.0         digest_0.6.12       bit_1.1-12          RSQLite_2.0        
 [8] memoise_1.1.0       tibble_1.3.3        gtable_0.2.0        pkgconfig_2.0.1     rlang_0.1.2         fastmatch_1.1-0     igraph_1.1.2       
[15] DBI_0.7             rvcheck_0.0.9       fgsea_1.2.1         gridExtra_2.2.1     stringr_1.2.0       bit64_0.9-7         grid_3.4.0         
[22] glue_1.1.1          qvalue_2.8.0        data.table_1.10.4   BiocParallel_1.10.1 GOSemSim_2.2.0      purrr_0.2.3         tidyr_0.7.0        
[29] GO.db_3.4.1         DO.db_2.9           ggplot2_2.2.1       reshape2_1.4.2      blob_1.1.0          magrittr_1.5        splines_3.4.0      
[36] scales_0.4.1        colorspace_1.3-2    stringi_1.1.5       lazyeval_0.2.0      munsell_0.4.3      

 

org.hs.eg.db • 2.3k views
ADD COMMENT
2
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 10 weeks ago
Italy

If you're working with Ensembl annotations I would stick to annotation resources that were built on Ensembl provided data (such as biomaRt or ensembldb). This also avoids potential problems and multi-mappings between Ensembl and NCBI. AFAIK the .eg. packages are built using information from NCBI and some discrepancies might be explained by the mapping between the databases (NCBI <-> Ensembl).

cheers, jo

 

The mapping using ensembldb:

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2017-04-25
> query(ah, c("EnsDb", "Hsapiens"))
AnnotationHub with 2 records
# snapshotDate(): 2017-04-25
# $dataprovider: Ensembl
# $species: Homo Sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'
            title                            
  AH53211 | Ensembl 87 EnsDb for Homo Sapiens
  AH53715 | Ensembl 88 EnsDb for Homo Sapiens
> edb <- ah[["AH53715"]]
loading from cache '/Users/jo//.AnnotationHub/60453'
> select(edb, keys = "MIR15A", keytype = "SYMBOL", columns = c("SYMBOL","DESCRIPTION","GENEID"))
  SYMBOL                                      DESCRIPTION          GENEID
1 MIR15A microRNA 15a [Source:HGNC Symbol;Acc:HGNC:31543] ENSG00000283785
ADD COMMENT
1
Entering edit mode
@danielvantwisk-13028
Last seen 4.6 years ago

The mapping is not incorrect, but based on older resources from NCBI.  The most recent version of org.Hs.eg.db(3.4.1) was built from resources from NCBI on March 29th 2017.  We do not continuously rebuild our annotation resources so that we can allow researchers who are using these resources to get reproducible results (so that the results from using an annotation package does not change from day to day).  If you are looking for a more up-to-date way of obtaining annotation information, you can use a Bioconductor package that accesses NCBI's API.  Below I've included two examples.  The first shows the date that the annotation resource was built for an org package.  The second shows a method of obtaining the most up-to-date annotation information using biomaRt.

Here, the attribute EGSOURCEDATE shows the date the annotation information was obtained from NCBI to build the pacakge.

library(org.Hs.eg.db)
org.Hs.eg.db
#> OrgDb object:
#> | DBSCHEMAVERSION: 2.1
#> | Db type: OrgDb
#> | Supporting package: AnnotationDbi
#> | DBSCHEMA: HUMAN_DB
#> | ORGANISM: Homo sapiens
#> | SPECIES: Human
#> | EGSOURCEDATE: 2017-Mar29
#> | EGSOURCENAME: Entrez Gene
#> | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
#> | CENTRALID: EG
#> | TAXID: 9606
#> | GOSOURCENAME: Gene Ontology
#> | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
#> | GOSOURCEDATE: 2017-Mar29
#> | GOEGSOURCEDATE: 2017-Mar29
#> | GOEGSOURCENAME: Entrez Gene
#> | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
#> | KEGGSOURCENAME: KEGG GENOME
#> | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
#> | KEGGSOURCEDATE: 2011-Mar15
#> | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
#> | GPSOURCEURL: 
#> | GPSOURCEDATE: 2017-Mar17
#> | ENSOURCEDATE: 2017-Mar29
#> | ENSOURCENAME: Ensembl
#> | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
#> | UPSOURCENAME: Uniprot
#> | UPSOURCEURL: http://www.UniProt.org/
#> | UPSOURCEDATE: Wed Apr  5 02:52:37 2017
#> 
#> Please see: help('select') for usage information

Here, we use biomaRt to obtain the most up-to-date annotation information.

library(biomaRt)
ensembl <- useMart("ensembl")
ensembl <- useDataset("hsapiens_gene_ensembl",mart=ensembl)
getBM(attributes=c("hgnc_symbol", "ensembl_gene_id"),
    filters=c('hgnc_symbol'),
    values= c('MIR15A','DLEU2'),
    mart=ensembl)
#>   hgnc_symbol ensembl_gene_id
#> 1       DLEU2 ENSG00000231607
#> 2      MIR15A ENSG00000283785
ADD COMMENT
2
Entering edit mode

In addition, do note that the orgDb packages that we supply are based on mappings from Entrez Gene IDs to all other annotation sources, and if you are trying to map from NCBI IDs to EBI IDs you will always run into disagreements between the annotation groups. To wit:

> getBM(c("hgnc_symbol","entrezgene","ensembl_gene_id"), "hgnc_symbol", c("MIR15A","DLEU2"), mart)
  hgnc_symbol entrezgene ensembl_gene_id
1       DLEU2         NA ENSG00000231607
2      MIR15A     406948 ENSG00000283785

So EBI doesn't seem to recognize that there is an Entrez Gene ID for DLEU2. But miR15A and DLEU2 are overlapping genes (miR15A comes from an intron of DLEU2), so you can end up with positional mappings that may not make sense if you look at an individual mapping, but that may be programmatically convenient for a group (say NCBI or EBI) who is trying to do a cross-mapping when their isn't agreement between them.

NCBI may have cleaned this one up, but rest assured there are many others. So as Jo already noted, if you want Ensembl IDs, use EBI based annotation packages. If you want Entrez Gene IDs, use NCBI based annotation packages. And probably don't use gene symbols for much if possible.

ADD REPLY
0
Entering edit mode
fengshou.ma ▴ 10
@fengshouma-13891
Last seen 7.3 years ago

It seems that all miRNA's ensembl id is wrong.

ADD COMMENT

Login before adding your answer.

Traffic: 612 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6