I am feeding biomaRt a list of ensembl IDs (object: `ensemblIDs`) from an RNA-Seq experiment
Then I am running the following function calls
ensembl <- useMart('ensembl', dataset='hsapiens_gene_ensembl') symbols.a <- getBM(attributes = c('ensembl_gene_id', 'ensembl_transcript_id', 'hgnc_symbol', 'external_gene_name', 'gene_biotype', 'description', 'name_1006', 'definition_1006'), filters = 'ensembl_gene_id', ensemblIDs, mart = ensembl)
After matching, I get back a list of results but although I am feeding the function ensembl IDs, the resulting data.frame returns a large number of NA. Taking a few examples:
ENSG00000139131, ENSG00000167157, ENSG00000149547
These all have ensembl gene web entries so it seems weird that it isn't matching them.
R version 3.4.0 (2017-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 [6] LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] doParallel_1.0.10 RSQLite_2.0 [3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0 GenomicFeatures_1.28.4 [5] Rtsne_0.13 plyr_1.8.4 [7] pheatmap_1.0.8 NMF_0.20.6 [9] cluster_2.0.6 rngtools_1.2.4 [11] pkgmaker_0.22 registry_0.3 [13] minfi_1.22.1 bumphunter_1.16.0 [15] locfit_1.5-9.1 iterators_1.0.8 [17] Biostrings_2.44.1 XVector_0.16.0 [19] limma_3.32.2 igraph_1.0.1 [21] hugene11sttranscriptcluster.db_8.6.0 hugene10sttranscriptcluster.db_8.6.0 [23] hthgu133a.db_3.2.3 hgug4112a.db_3.2.3 [25] hgu95av2.db_3.2.3 hgu133plus2.db_3.2.3 [27] hgu133b.db_3.2.3 hgu133a2.db_3.2.3 [29] hgu133a.db_3.2.3 org.Hs.eg.db_3.4.1 [31] gplots_3.0.1 GEOquery_2.42.0 [33] genefilter_1.58.1 foreach_1.4.3 [35] DESeq2_1.16.1 SummarizedExperiment_1.6.3 [37] DelayedArray_0.2.7 matrixStats_0.52.2 [39] GenomicRanges_1.28.3 GenomeInfoDb_1.12.2 [41] biomaRt_2.32.1 beadarray_2.26.1 [43] ggplot2_2.2.1 annotate_1.54.0 [45] XML_3.98-1.9 AnnotationDbi_1.38.1 [47] IRanges_2.10.2 S4Vectors_0.14.3 [49] affy_1.54.0 Biobase_2.36.2 [51] BiocGenerics_0.22.0 loaded via a namespace (and not attached): [1] colorspace_1.3-2 siggenes_1.50.0 mclust_5.3 htmlTable_1.9 base64enc_0.1-3 base64_2.0 [7] affyio_1.46.0 bit64_0.9-7 codetools_0.2-15 splines_3.4.0 geneplotter_1.54.0 knitr_1.16 [13] Formula_1.2-2 Rsamtools_1.28.0 gridBase_0.4-7 compiler_3.4.0 httr_1.2.1 backports_1.1.0 [19] Matrix_1.2-10 lazyeval_0.2.0 BeadDataPackR_1.28.0 acepack_1.4.1 htmltools_0.3.6 tools_3.4.0 [25] gtable_0.2.0 GenomeInfoDbData_0.99.0 reshape2_1.4.2 doRNG_1.6.6 Rcpp_0.12.11 multtest_2.32.0 [31] nlme_3.1-131 gdata_2.18.0 preprocessCore_1.38.1 rtracklayer_1.36.3 stringr_1.2.0 gtools_3.5.0 [37] beanplot_1.2 MASS_7.3-47 zlibbioc_1.22.0 scales_0.4.1 BiocInstaller_1.26.0 RColorBrewer_1.1-2 [43] memoise_1.1.0 gridExtra_2.2.1 rpart_4.1-11 reshape_0.8.6 latticeExtra_0.6-28 stringi_1.1.5 [49] checkmate_1.8.3 caTools_1.17.1 BiocParallel_1.10.1 rlang_0.1.1 pkgconfig_2.0.1 bitops_1.0-6 [55] nor1mix_1.2-2 lattice_0.20-35 GenomicAlignments_1.12.1 htmlwidgets_0.9 bit_1.1-12 magrittr_1.5 [61] R6_2.2.2 Hmisc_4.0-3 DBI_0.7 foreign_0.8-69 survival_2.41-3 RCurl_1.95-4.8 [67] nnet_7.3-12 tibble_1.3.3 KernSmooth_2.23-15 grid_3.4.0 data.table_1.10.4 blob_1.1.0 [73] digest_0.6.12 xtable_1.8-2 illuminaio_0.18.0 openssl_0.9.6 munsell_0.4.3 quadprog_1.5-5
Can you provide the output from
sessionInfo()
so we can see which version of biomaRt you're using?Edited above to add. Just to add:
Of the original list I fed the function, around 20% weren't matched. I subset those ensembl IDs which were not matched and ran just them (around 5000) through the function again. Most were matched but around 20% weren't matched again. I repeated the process but wasn't able to extract the additional missing genes.