ensembldb and pseudogenes mapping to the same Ensembl ID
2
1
Entering edit mode
meeta.mistry ▴ 30
@meetamistry-7355
Last seen 2.5 years ago
United States

Hello,

I encountered a problem when mapping Ensembl genes to Entrez IDs and was wondering if there was a way around this. For a list of Ensembl gene IDs I used the select function to return to me gene symbols and Entrez IDs. 

common_genes <- select(EnsDb.Mmusculus.v79, keys=common, 
        columns=c("ENTREZID", "SYMBOL", "GENE_ID"), 
        keytype="GENEID")

Browsing through the table I noticed duplicate matches returned (i.e. for a singe Ensembl ID there are two Entrez IDs). I searched these IDs in the Entrez database and found that they are pseudogenes and in fact have different gene symbols but are not reported that way in output.

For example:

               GENEID  ENTREZID SYMBOL
72 ENSMUSG00000000740    270106  Rpl13
73 ENSMUSG00000000740 100040416  Rpl13

The second EntrezID is for Rpl13-ps6 which maps to ENSMUSG00000059776; and so this table is reporting incorrectly.

Is there anyway of identifying these pseudogenes using information stored in the database. Perhaps if there are Entrez gene symbols stored we could use those to filter out pseudogenes?

Any help on this would be much appreciated. Thanks in advance.

Meeta

 

 

 

ensembldb • 1.6k views
ADD COMMENT
1
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 10 weeks ago
Italy

Dear Meeta,

mapping between Entrez and Ensembl IDs is always problematic. EnsDb databases provide you with all the information from Ensembl (for a specific release) and in version 79 (March 2015) this one gene was annotated to two Entrez identifiers. Unfortunately, in EnsDb databases, there is no additional information about Entrez genes available (such as whether an Entrez gene is a pseudogene). For the mapping you could also use the org.Mm.eg.db package instead (that uses annotations from NCBI):

> library(org.Mm.eg.db)
> select(org.Mm.eg.db, columns = c("ENTREZID", "SYMBOL", "ENSEMBL"), keys = "Rpl13", keytype = "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
  SYMBOL ENTREZID            ENSEMBL
1  Rpl13   270106 ENSMUSG00000000740

 

Or, alternatively, use an EnsDb database for a more recent Ensembl release (seems it was fixed in the more recent release):

> library(AnnotationHub)
> edb <- query(AnnotationHub(), "EnsDb.Mmusculus.v90")[[1]]
snapshotDate(): 2017-10-27
loading from cache '/Users/jo//.AnnotationHub/64508'
> select(edb, columns = c("ENTREZID", "SYMBOL", "GENEID"), keys = "Rpl13", keytype = "SYMBOL")
  ENTREZID SYMBOL             GENEID
1   270106  Rpl13 ENSMUSG00000000740

 

At last my session info:

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin17.3.0/x86_64 (64-bit)
Running under: macOS High Sierra 10.13.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] ensembldb_2.2.0        AnnotationFilter_1.2.0 GenomicFeatures_1.30.0
 [4] GenomicRanges_1.30.1   GenomeInfoDb_1.14.0    AnnotationHub_2.10.1  
 [7] org.Mm.eg.db_3.5.0     AnnotationDbi_1.40.0   IRanges_2.12.0        
[10] S4Vectors_0.16.0       Biobase_2.38.0         BiocGenerics_0.24.0   
[13] BiocInstaller_1.28.0  

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.8.1    progress_1.1.2               
 [3] lattice_0.20-35               htmltools_0.3.6              
 [5] rtracklayer_1.38.2            yaml_2.1.16                  
 [7] interactiveDisplayBase_1.16.0 blob_1.1.0                   
 [9] XML_3.98-1.9                  rlang_0.1.6                  
[11] pillar_1.0.1                  DBI_0.7                      
[13] BiocParallel_1.12.0           bit64_0.9-7                  
[15] matrixStats_0.52.2            GenomeInfoDbData_1.0.0       
[17] ProtGenerics_1.10.0           stringr_1.2.0                
[19] zlibbioc_1.24.0               Biostrings_2.46.0            
[21] memoise_1.1.0                 biomaRt_2.34.1               
[23] httpuv_1.3.5                  curl_3.1                     
[25] Rcpp_0.12.14                  xtable_1.8-2                 
[27] DelayedArray_0.4.1            XVector_0.18.0               
[29] mime_0.5                      bit_1.1-12                   
[31] Rsamtools_1.30.0              RMySQL_0.10.13               
[33] digest_0.6.13                 stringi_1.1.6                
[35] shiny_1.0.5                   grid_3.4.3                   
[37] tools_3.4.3                   bitops_1.0-6                 
[39] magrittr_1.5                  lazyeval_0.2.1               
[41] RCurl_1.95-4.10               tibble_1.4.1                 
[43] RSQLite_2.0                   pkgconfig_2.0.1              
[45] Matrix_1.2-12                 prettyunits_1.0.2            
[47] assertthat_0.2.0              httr_1.3.1                   
[49] R6_2.2.2                      GenomicAlignments_1.14.1     
[51] compiler_3.4.3               

 

ADD COMMENT
0
Entering edit mode
meeta.mistry ▴ 30
@meetamistry-7355
Last seen 2.5 years ago
United States

Hi Johannes,

Thank you for your quick reply! Both of those alternatives are good to know and very helpful since I use this package often for cross-database annotations.

Best,

Meeta

ADD COMMENT

Login before adding your answer.

Traffic: 573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6