How to map all Ensembl IDs to Gene Symbols- Problem with AnootationDbi
2
5
Entering edit mode
gokce.ouz ▴ 70
@gokceouz-11205
Last seen 8.2 years ago

Hi,

I am analysing my RNA-Seq data with DESeq2. At the end I would like to convert significantly expressed  ensembl IDs to GeneSymbols. I am using AnnotationDbi for this. However, I realized that not all the ensembl IDs are converted to Gene Symbols. 25072 out of 48607 returned as NA. More than 11000 of these IDs are actually significantly differentially expressed .  So to double check, I put the IDs which got "NA" for Gene Symbol to Biomart and it converted them to Gene symbols( as seen in the figure). So now I am confused, am I doing something wrong ? Or is there any other alternative to extract all Gene symbols ?

Thanks in advance,

Gokce 

library("AnnotationDbi")
library("org.Hs.eg.db")
res<-results(dds,alpha=.05, contrast=c("Type", "Disease", "Control"))
res$symbol <- mapIds(org.Hs.eg.db,keys=row.names(res),column="SYMBOL", keytype="ENSEMBL", multiVals="first")

 

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] edgeR_3.10.5              limma_3.24.15            
 [3] amap_0.8-14               sva_3.14.0               
 [5] mgcv_1.8-10               nlme_3.1-122             
 [7] doParallel_1.0.10         iterators_1.0.8          
 [9] foreach_1.4.3             reshape_0.8.5            
[11] cluster_2.0.3             matrixStats_0.50.1       
[13] flashClust_1.01-2         WGCNA_1.51               
[15] fastcluster_1.1.16        dynamicTreeCut_1.62      
[17] xlsx_0.5.7                xlsxjars_0.6.1           
[19] rJava_0.9-7               pheatmap_1.0.8           
[21] genefilter_1.50.0         gplots_2.17.0            
[23] RColorBrewer_1.1-2        vsn_3.36.0               
[25] org.Hs.eg.db_3.1.2        RSQLite_1.0.0            
[27] DBI_0.3.1                 DESeq2_1.8.2             
[29] RcppArmadillo_0.6.400.2.2 Rcpp_0.12.7              
[31] BiocParallel_1.2.22       GenomicAlignments_1.4.2  
[33] GenomicFeatures_1.20.6    AnnotationDbi_1.30.1     
[35] Biobase_2.28.0            Rsamtools_1.20.5         
[37] Biostrings_2.36.4         XVector_0.8.0            
[39] GenomicRanges_1.20.8      GenomeInfoDb_1.4.3       
[41] IRanges_2.2.9             S4Vectors_0.6.6          
[43] BiocGenerics_0.14.0       Hmisc_3.17-1             
[45] ggplot2_2.1.0             Formula_1.2-1            
[47] survival_2.38-3           lattice_0.20-33          
[49] BiocInstaller_1.20.3     

loaded via a namespace (and not attached):
 [1] splines_3.2.0         gtools_3.5.0          affy_1.46.1          
 [4] latticeExtra_0.6-26   impute_1.42.0         colorspace_1.2-6     
 [7] preprocessCore_1.30.0 Matrix_1.2-3          plyr_1.8.4           
[10] XML_3.98-1.3          biomaRt_2.24.1        zlibbioc_1.14.0      
[13] xtable_1.8-0          GO.db_3.1.2           scales_0.4.0         
[16] gdata_2.17.0          affyio_1.36.0         annotate_1.46.1      
[19] nnet_7.3-11           foreign_0.8-66        tools_3.2.0          
[22] munsell_0.4.3         locfit_1.5-9.1        lambda.r_1.1.7       
[25] caTools_1.17.1        futile.logger_1.4.1   grid_3.2.0           
[28] RCurl_1.95-4.7        bitops_1.0-6          gtable_0.2.0         
[31] codetools_0.2-14      gridExtra_2.0.0       rtracklayer_1.28.10  
[34] futile.options_1.0.0  KernSmooth_2.23-15    geneplotter_1.46.0   
[37] rpart_4.1-10          acepack_1.3-3.3      

 

rnaseq annotationdbi • 40k views
ADD COMMENT
7
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 8 weeks ago
Italy

You could also use ensembldb to do the mapping between Ensembl gene IDs and gene names (or symbols). You would need also one of the EnsDb packages providing the actual annotation (such as EnsDb.Hsapiens.v75 for genome release GRCh37 or EnsDb.Hsapiens.v79 vor GRCh38). Check the ensembldb vignette for more information (http://www.bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html).

You could basically use the same AnnotationDbi call that you use, but provide the EnsDB object instead of the

org.Hs.eg.db.

Just one clarification: the gene names that are listed above in your table are not gene symbols. These are rather the names for the gene that are provided by Ensembl. For protein coding genes the gene names correspond however to the HGNC symbols.

Hope this helps.

ADD COMMENT
0
Entering edit mode

Thanks a lot for the suggestion and clarification Johannes. I will implement it to my analysis as soon as I solve my R version  problem.

Actually when I was running the code with org.Hs.eg.db, I was expecting to see all the corresponding HGNC symbols but it did not return which actually surprised me. So now when I run using EnsDb.Hsapiens.v79, it will return gene names or gene symbols ?

Best regards,

Gokce

ADD REPLY
1
Entering edit mode

EnsDb.Hsapiens.v79 will return you the gene names (even if you specify "SYMBOL"). I decided to go for the gene name in all cases, as that is species-independent.
 

ADD REPLY
0
Entering edit mode
ADD REPLY
6
Entering edit mode
@valerie-obenchain-4275
Last seen 2.9 years ago
United States

Hi,

The OrgDb packages are a collection of data from many different sources, NCBI, UCSC, Ensembl, etc. The packages are Entrez gene centric in that we start with the list of Entrez gene ids from NCBI and annotate to that id. Data downloaded from Ensembl is matched to the Entrez gene id, if no mapping between the two exists then the Ensembl id doesn't end up in the OrgDb package.

Taking the first 3 from your list as an example, 

ensemblGenes <- c("ENSG00000108958", "ENSG00000123009", "ENSG00000124399")
symbols <- c("AC016292.3", "NME2P1", "NDUFB4P12")

As you said, the OrgDb package doesn't have data for these Ensembl ids:

> select(org.Hs.eg.db,
+        key=ensemblGenes, columns=columns(org.Hs.eg.db),
+        keytype="ENSEMBL")
Error in .testForValidKeys(x, keys, keytype, fks) :
  None of the keys entered are valid keys for 'ENSEMBL'. Please use the keys method to see a listing of valid arguments.

Using the symbols instead, we see the OrgDb has data for some of these genes but no Ensembl -> Entrez id mapping:

> select(org.Hs.eg.db, key=symbols,
+        columns=c("ENTREZID", "ENSEMBL"),
+        keytype="SYMBOL")
'select()' returned 1:1 mapping between keys and columns
      SYMBOL ENTREZID ENSEMBL
1 AC016292.3     <NA>    <NA>
2     NME2P1   283458    <NA>
3  NDUFB4P12   402175    <NA>


biomaRt also shows no mapping between the two and is missing the symbol for the first gene:
library(biomaRt)
mart<- useDataset("hsapiens_gene_ensembl", useMart("ENSEMBL_MART_ENSEMBL"))

> getBM(filters="ensembl_gene_id",
+       attributes=c("ensembl_gene_id", "entrezgene",
+                    "hgnc_symbol"),
+       values=ensemblGenes,
+       mart=mart)
  ensembl_gene_id entrezgene hgnc_symbol
1 ENSG00000108958         NA           
2 ENSG00000123009         NA      NME2P1
3 ENSG00000124399         NA   NDUFB4P12
                                                                                   

Using Jo's EnsDb.Hsapiens.v79, it looks like the Ensembl id is called GENEID so we use that as 'keytype'. It also confirms no mapping between Ensembl and Entrez but it does return a value for all the symbols.

> select(EnsDb.Hsapiens.v79, key=ensemblGenes, 
+        columns=c("ENTREZID", "SYMBOL"), 
+        keytype="GENEID")
           GENEID ENTREZID       SYMBOL
1 ENSG00000108958            AC016292.3
2 ENSG00000123009                NME2P1
3 ENSG00000124399          RP11-663P9.2

So for the task of mapping Ensembl ids to gene symbols it looks like Jo's ensembl package is the most comprehensive. 

Valerie
 

ADD COMMENT
0
Entering edit mode

I really appreciate for your detailed answer Valerie.

ADD REPLY

Login before adding your answer.

Traffic: 488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6