Question

How to map all Ensembl IDs to Gene Symbols- Problem with AnootationDbi

5

Entering edit mode

gokce.ouz ▴ 70

@gokceouz-11205

Last seen 8.2 years ago

Hi,

I am analysing my RNA-Seq data with DESeq2. At the end I would like to convert significantly expressed ensembl IDs to GeneSymbols. I am using AnnotationDbi for this. However, I realized that not all the ensembl IDs are converted to Gene Symbols. 25072 out of 48607 returned as NA. More than 11000 of these IDs are actually significantly differentially expressed . So to double check, I put the IDs which got "NA" for Gene Symbol to Biomart and it converted them to Gene symbols( as seen in the figure). So now I am confused, am I doing something wrong ? Or is there any other alternative to extract all Gene symbols ?

Thanks in advance,

Gokce

library("AnnotationDbi")
library("org.Hs.eg.db")
res<-results(dds,alpha=.05, contrast=c("Type", "Disease", "Control"))
res$symbol <- mapIds(org.Hs.eg.db,keys=row.names(res),column="SYMBOL", keytype="ENSEMBL", multiVals="first")

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] edgeR_3.10.5              limma_3.24.15            
 [3] amap_0.8-14               sva_3.14.0               
 [5] mgcv_1.8-10               nlme_3.1-122             
 [7] doParallel_1.0.10         iterators_1.0.8          
 [9] foreach_1.4.3             reshape_0.8.5            
[11] cluster_2.0.3             matrixStats_0.50.1       
[13] flashClust_1.01-2         WGCNA_1.51               
[15] fastcluster_1.1.16        dynamicTreeCut_1.62      
[17] xlsx_0.5.7                xlsxjars_0.6.1           
[19] rJava_0.9-7               pheatmap_1.0.8           
[21] genefilter_1.50.0         gplots_2.17.0            
[23] RColorBrewer_1.1-2        vsn_3.36.0               
[25] org.Hs.eg.db_3.1.2        RSQLite_1.0.0            
[27] DBI_0.3.1                 DESeq2_1.8.2             
[29] RcppArmadillo_0.6.400.2.2 Rcpp_0.12.7              
[31] BiocParallel_1.2.22       GenomicAlignments_1.4.2  
[33] GenomicFeatures_1.20.6    AnnotationDbi_1.30.1     
[35] Biobase_2.28.0            Rsamtools_1.20.5         
[37] Biostrings_2.36.4         XVector_0.8.0            
[39] GenomicRanges_1.20.8      GenomeInfoDb_1.4.3       
[41] IRanges_2.2.9             S4Vectors_0.6.6          
[43] BiocGenerics_0.14.0       Hmisc_3.17-1             
[45] ggplot2_2.1.0             Formula_1.2-1            
[47] survival_2.38-3           lattice_0.20-33          
[49] BiocInstaller_1.20.3     

loaded via a namespace (and not attached):
 [1] splines_3.2.0         gtools_3.5.0          affy_1.46.1          
 [4] latticeExtra_0.6-26   impute_1.42.0         colorspace_1.2-6     
 [7] preprocessCore_1.30.0 Matrix_1.2-3          plyr_1.8.4           
[10] XML_3.98-1.3          biomaRt_2.24.1        zlibbioc_1.14.0      
[13] xtable_1.8-0          GO.db_3.1.2           scales_0.4.0         
[16] gdata_2.17.0          affyio_1.36.0         annotate_1.46.1      
[19] nnet_7.3-11           foreign_0.8-66        tools_3.2.0          
[22] munsell_0.4.3         locfit_1.5-9.1        lambda.r_1.1.7       
[25] caTools_1.17.1        futile.logger_1.4.1   grid_3.2.0           
[28] RCurl_1.95-4.7        bitops_1.0-6          gtable_0.2.0         
[31] codetools_0.2-14      gridExtra_2.0.0       rtracklayer_1.28.10  
[34] futile.options_1.0.0  KernSmooth_2.23-15    geneplotter_1.46.0   
[37] rpart_4.1-10          acepack_1.3-3.3

rnaseq annotationdbi • 40k views

ADD COMMENT • link updated 8.2 years ago by Valerie Obenchain ★ 6.8k • written 8.2 years ago by gokce.ouz ▴ 70

6

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.9 years ago

United States

Hi,

The OrgDb packages are a collection of data from many different sources, NCBI, UCSC, Ensembl, etc. The packages are Entrez gene centric in that we start with the list of Entrez gene ids from NCBI and annotate to that id. Data downloaded from Ensembl is matched to the Entrez gene id, if no mapping between the two exists then the Ensembl id doesn't end up in the OrgDb package.

Taking the first 3 from your list as an example,

ensemblGenes <- c("ENSG00000108958", "ENSG00000123009", "ENSG00000124399") symbols <- c("AC016292.3", "NME2P1", "NDUFB4P12")

As you said, the OrgDb package doesn't have data for these Ensembl ids:

> select(org.Hs.eg.db, + key=ensemblGenes, columns=columns(org.Hs.eg.db), + keytype="ENSEMBL") Error in .testForValidKeys(x, keys, keytype, fks) : None of the keys entered are valid keys for 'ENSEMBL'. Please use the keys method to see a listing of valid arguments.

Using the symbols instead, we see the OrgDb has data for some of these genes but no Ensembl -> Entrez id mapping:

> select(org.Hs.eg.db, key=symbols, + columns=c("ENTREZID", "ENSEMBL"), + keytype="SYMBOL") 'select()' returned 1:1 mapping between keys and columns SYMBOL ENTREZID ENSEMBL 1 AC016292.3 <NA> <NA> 2 NME2P1 283458 <NA> 3 NDUFB4P12 402175 <NA>

biomaRt also shows no mapping between the two and is missing the symbol for the first gene:
library(biomaRt) mart<- useDataset("hsapiens_gene_ensembl", useMart("ENSEMBL_MART_ENSEMBL")) > getBM(filters="ensembl_gene_id", + attributes=c("ensembl_gene_id", "entrezgene", + "hgnc_symbol"), + values=ensemblGenes, + mart=mart) ensembl_gene_id entrezgene hgnc_symbol 1 ENSG00000108958 NA 2 ENSG00000123009 NA NME2P1 3 ENSG00000124399 NA NDUFB4P12
Using Jo's EnsDb.Hsapiens.v79, it looks like the Ensembl id is called GENEID so we use that as 'keytype'. It also confirms no mapping between Ensembl and Entrez but it does return a value for all the symbols.

> select(EnsDb.Hsapiens.v79, key=ensemblGenes, + columns=c("ENTREZID", "SYMBOL"), + keytype="GENEID") GENEID ENTREZID SYMBOL 1 ENSG00000108958 AC016292.3 2 ENSG00000123009 NME2P1 3 ENSG00000124399 RP11-663P9.2

So for the task of mapping Ensembl ids to gene symbols it looks like Jo's ensembl package is the most comprehensive.

Valerie

ADD COMMENT • link 8.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

I really appreciate for your detailed answer Valerie.

ADD REPLY • link 8.2 years ago gokce.ouz ▴ 70

Martin Morgan · Accepted Answer · 2016-09-25

7

Entering edit mode

Johannes Rainer ★ 2.1k

@johannes-rainer-6987

Last seen 8 weeks ago

Italy

You could also use ensembldb to do the mapping between Ensembl gene IDs and gene names (or symbols). You would need also one of the EnsDb packages providing the actual annotation (such as EnsDb.Hsapiens.v75 for genome release GRCh37 or EnsDb.Hsapiens.v79 vor GRCh38). Check the ensembldb vignette for more information (http://www.bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html).

You could basically use the same AnnotationDbi call that you use, but provide the EnsDB object instead of the

org.Hs.eg.db.

Just one clarification: the gene names that are listed above in your table are not gene symbols. These are rather the names for the gene that are provided by Ensembl. For protein coding genes the gene names correspond however to the HGNC symbols.

Hope this helps.

ADD COMMENT • link updated 6.2 years ago by Martin Morgan 25k • written 8.2 years ago by Johannes Rainer ★ 2.1k

0

Entering edit mode

Thanks a lot for the suggestion and clarification Johannes. I will implement it to my analysis as soon as I solve my R version problem.

Actually when I was running the code with org.Hs.eg.db, I was expecting to see all the corresponding HGNC symbols but it did not return which actually surprised me. So now when I run using EnsDb.Hsapiens.v79, it will return gene names or gene symbols ?

Best regards,

Gokce

ADD REPLY • link 8.2 years ago gokce.ouz ▴ 70

1

Entering edit mode

EnsDb.Hsapiens.v79 will return you the gene names (even if you specify "SYMBOL"). I decided to go for the gene name in all cases, as that is species-independent.

ADD REPLY • link 8.2 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

Here is the working link to the ensembldb vignette:

http://www.bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html

ADD REPLY • link 6.2 years ago Kamil Slowikowski ▴ 30