Question

biomaRt: Ensembl -> Entrez conversion error?

0

Entering edit mode

EdgeR_A_Poe ▴ 10

@edger_a_poe-18527

Last seen 6.4 years ago

Prior to doing GO term enrichment analysis, I've been trying to convert the Ensembl IDs of differentially expressed genes to Entrez IDs using the R package biomaRt. I've done some spot checking and it looks like everything is getting converted correctly. However, I've run into two particular Ensembl IDs that are getting incorrectly converted. The Ensembl IDs in question are: ENSMUSG00000071633.11 and ENSMUSG00000071646.9 which are being converted to 208285 and 72465, respectively. After comparing searches on Ensembl and NCBI's database, there doesn't seem to be a relationship between the Ensembl and Entrez IDs that I'm getting. This makes me suspicious that there might be other errors lurking in my dataset. Has anyone else run into this issue before and if so, how did you resolve it? Am I missing something obvious here?

R script used to convert Ensembl IDs to Entrez IDs:

DEListTangerineRed <- read.csv("DEgenesTangerineRed.csv", stringsAsFactors = FALSE)

mart <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")

TangerineRedGenes <- DEListTangerineRed$X

ConvertedTangerineRedGenes <- getBM(filters= "ensembl_gene_id_version", attributes= c("ensembl_gene_id_version",

"entrezgene", "description"),values <- TangerineRedGenes, mart= mart)

write.csv(ConvertedTangerineRedGenes, file="ConvertedTangerineRedGenes.csv")

Session Info:

R version 3.5.0 (2018-04-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Matrix products: default BLAS/LAPACK: /opt/intel/compilers_and_libraries_2016.3.210/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.38.0 org.Mm.eg.db_3.7.0 topGO_2.34.0 SparseM_1.77 GO.db_3.7.0 [6] AnnotationDbi_1.44.0 graph_1.60.0 DESeq2_1.22.1 SummarizedExperiment_1.12.0 DelayedArray_0.8.0 [11] BiocParallel_1.16.0 matrixStats_0.54.0 Biobase_2.42.0 GenomicRanges_1.34.0 GenomeInfoDb_1.18.1 [16] IRanges_2.16.0 S4Vectors_0.20.0 BiocGenerics_0.28.0 gplots_3.0.1 reshape2_1.4.3 [21] RColorBrewer_1.1-2 Rsubread_1.30.9 Glimma_1.10.0 edgeR_3.22.5 limma_3.38.2 loaded via a namespace (and not attached): [1] bitops_1.0-6 bit64_0.9-7 httr_1.3.1 progress_1.2.0 tools_3.5.0 backports_1.1.2 [7] R6_2.2.2 rpart_4.1-13 KernSmooth_2.23-15 Hmisc_4.1-1 DBI_1.0.0 lazyeval_0.2.1 [13] colorspace_1.3-2 nnet_7.3-12 prettyunits_1.0.2 gridExtra_2.3 curl_3.2 bit_1.1-14 [19] compiler_3.5.0 htmlTable_1.12 caTools_1.17.1.1 scales_0.5.0 checkmate_1.8.5 genefilter_1.64.0 [25] stringr_1.3.1 digest_0.6.15 foreign_0.8-70 XVector_0.22.0 base64enc_0.1-3 pkgconfig_2.0.1 [31] htmltools_0.3.6 htmlwidgets_1.3 rlang_0.3.0.1 rstudioapi_0.7 RSQLite_2.1.1 bindr_0.1.1 [37] jsonlite_1.5 gtools_3.8.1 acepack_1.4.1 dplyr_0.7.4 RCurl_1.95-4.11 magrittr_1.5 [43] GenomeInfoDbData_1.2.0 Formula_1.2-3 Matrix_1.2-14 Rcpp_0.12.16 munsell_0.4.3 yaml_2.2.0 [49] stringi_1.2.2 zlibbioc_1.28.0 plyr_1.8.4 grid_3.5.0 blob_1.1.1 gdata_2.18.0 [55] crayon_1.3.4 lattice_0.20-35 splines_3.5.0 annotate_1.60.0 hms_0.4.2 locfit_1.5-9.1 [61] knitr_1.20 pillar_1.2.2 geneplotter_1.60.0 XML_3.98-1.16 glue_1.2.0 latticeExtra_0.6-28 [67] data.table_1.11.8 gtable_0.2.0 assertthat_0.2.0 ggplot2_3.1.0 xtable_1.8-2 survival_2.41-3 [73] tibble_1.4.2 memoise_1.1.0 bindrcpp_0.2.2 cluster_2.0.7-1

biomart entrez gene identifiers ensembl conversion • 3.0k views

ADD COMMENT • link updated 6.4 years ago by Mike Smith ★ 6.6k • written 6.4 years ago by EdgeR_A_Poe ▴ 10

score 1 · Answer 1 · 2018-11-27

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

> getBM(c("entrezgene","ensembl_gene_id"), "ensembl_gene_id", c("ENSMUSG00000071633","ENSMUSG00000071646"), mart)
  entrezgene    ensembl_gene_id
1     240549 ENSMUSG00000071633
2      23942 ENSMUSG00000071646

Are you returning the Ensembl gene ID in your output? If not, do note that the output from biomaRt (as with all databases, so far as I know) is unordered, so you cannot expect to get things back in the same order as your query.

ADD COMMENT • link 6.4 years ago James W. MacDonald 68k

1

Entering edit mode

Hi James,

You're correct about the IDs being unordered. I was re-ordering them in Excel so that the DE information (F value, P-Value, etc.) was correctly lined up. What I discovered is that in some cases, instead of returning an "NA" for Ensembl IDs it doesn't recognize, it will occasionally just drop entries from the list all together! Therefore, the length of IDs from my DE analysis and converted IDs were different and I basically had an issue with frame shifting in the compiled dataset. This seems like a really weird error on the package's part as far as I can tell.

ADD REPLY • link 6.4 years ago EdgeR_A_Poe ▴ 10

score 0 · Answer 2 · 2018-11-27

0

Entering edit mode

Mike Smith ★ 6.6k

@mike-smith

Last seen 1 hour ago

EMBL Heidelberg

Thanks for the example code, but I don't see this behaviour when I try to run your query. I think the example below is doing the same thing, but only using the two Ensembl IDs you mention, and I get back different Entrez IDs to you.

library(biomaRt)

## Use mouse genes mart
mart <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")

## Ensembl IDs of interest
TangerineRedGenes <- c("ENSMUSG00000071633.11", 
                       "ENSMUSG00000071646.9")

## Run biomaRt query
ConvertedGenes <- getBM(filters = "ensembl_gene_id_version", 
                        attributes = c("ensembl_gene_id_version", "entrezgene", "description"),
                        values = TangerineRedGenes, 
                        mart = mart)

> ConvertedGenes
  ensembl_gene_id_version entrezgene
1   ENSMUSG00000071633.11     240549
2    ENSMUSG00000071646.9      23942
                                                                      description
1                         predicted gene 4952 [Source:MGI Symbol;Acc:MGI:3643569]
2 metastasis-associated gene family, member 2 [Source:MGI Symbol;Acc:MGI:1346340]

Perhaps there's something else going on with the full set of IDs in DEgenesTangerineRed.csv? Maybe you can share that file? Feel free to email it to me if you don't want to make it public.

ADD COMMENT • link 6.4 years ago Mike Smith ★ 6.6k

0

Entering edit mode

Hi Mike,

Thanks for your response. Any insight about biomaRt dropping lines containing Ensembl IDs it doesn't recognize? I double checked to make sure the list going in was the correct length but the output is missing several lines. This is what caused a frame shift in my dataset and messed up the alignment of many Ensembl IDs with their Entrez IDs.

ADD REPLY • link 6.4 years ago EdgeR_A_Poe ▴ 10

1

Entering edit mode

I'm afraid it's a property of BioMart - the database system used to serve up the Ensembl data. It's not something I can fix in the R package; if you run the same query in the Ensembl BioMart web interface it will silently drop empty results there too.

As James mentioned, the safest thing to do is make sure your list of attributes includes the property you're using to search with, which at least lets you check if any entries are missing or there are duplicates. It's quite common to have one-to-many matches e.g. genes to transcripts, where you can't tell a priori how many rows you expect in the results.

I also find it helpful to use family of functions like dplyr::left_join() to combine two data frames with a common column, as that will keep any row that doesn't match and simply insert NA, so you don't end up with the frame shift effect.

One final thing I'd recommend is to avoid using the Ensembl IDs with the version attached. To get a match in Ensembl you have to match the exact version in current use, anything else will be dropped, and there's always a chance that changed really recently. IMHO it's better to use the more general stem of the ID and worry about versions when you're really digging into whether you list of top hits makes biological sense. There's a chance your ID conversions that fail will be found if you use the IDs without versions e.g. ENSMUSG00000071633 vs ENSMUSG00000071633.11 and the filter ensembl_gene_id vs ensembl_gene_id_version

ADD REPLY • link 6.4 years ago Mike Smith ★ 6.6k

0

Entering edit mode

Thanks again, Mike,

Yeah, I think dplyr::left_join() would have been helpful here. And you're correct - removing the version number from those failed genes and re-running biomaRt with the filter ensembl_gene_id worked.

ADD REPLY • link 6.4 years ago EdgeR_A_Poe ▴ 10