Hi,
I have a vector with Ensembl gene IDs
> headgene.id)
[1] "ENSG00000223972" "ENSG00000227232" "ENSG00000278267" "ENSG00000243485"
[5] "ENSG00000274890" "ENSG00000237613"
I am trying to annotate the IDs using biomaRt
> library(biomaRt)
> ensembl <- useMart("ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl",
host = "www.ensembl.org")
However, I get an error. The curious thing is that I get this error only when my vector has length 993 or longer, never when it is shorter, using a random selection of IDs. So this always works:
> mat.cpm.annot <- biomaRt::getBM(attributes = c("ensembl_gene_id", "hgnc_id", "hgnc_symbol", "description"), filter = "ensembl_gene_id", samplegene.id, 992), mart = ensembl, uniqueRows = TRUE)
And this gives me an error:
> mat.cpm.annot <- biomaRt::getBM(attributes = c("ensembl_gene_id", "hgnc_id", "hgnc_symbol", "description"), filter = "ensembl_gene_id", samplegene.id, 993), mart = ensembl, uniqueRows = TRUE)
Error in biomaRt::getBM(attributes = c("ensembl_gene_id", "hgnc_id", "hgnc_symbol", :
Query ERROR: caught BioMart::Exception: non-BioMart die():
not well-formed (invalid token) at line 1, column 16292, byte 16292 at /nfs/public/release/ensweb-software/sharedsw/2017_04_03/linuxbrew/Cellar/perl/5.24.1/lib/perl5/site_perl/5.24.1/x86_64-linux-thread-multi/XML/Parser.pm line 187.
XML::Simple called at /nfs/public/release/ensweb/latest/live/mart/www_90/biomart-perl/lib/BioMart/Query.pm line 1935.
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)
Matrix products: default
BLAS: /share/apps/cto/packages/R/3.4.2/lib64/R/lib/libRblas.so
LAPACK: /share/apps/cto/packages/R/3.4.2/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.32.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 IRanges_2.10.5 XML_3.98-1.9
[4] digest_0.6.12 bitops_1.0-6 DBI_0.7
[7] stats4_3.4.2 RSQLite_2.0 rlang_0.1.2
[10] blob_1.1.0 S4Vectors_0.14.7 tools_3.4.2
[13] bit64_0.9-7 Biobase_2.36.2 RCurl_1.95-4.8
[16] bit_1.1-12 parallel_3.4.2 compiler_3.4.2
[19] BiocGenerics_0.22.1 AnnotationDbi_1.38.2 memoise_1.1.0
[22] tibble_1.3.4
Any idea, what is going on?
Cheers,
Georg
Those answers are still valid, but I want to add that I don't experience the problem you're seeing, so maybe it has already been fixed at the Ensembl side.
Thanks a lot. I tried both suggested solutions. With the mirror service I got the same error. Installing and using the devel package however made the error go away. Just to clarify: The recommendation not to run querys with more than 500 search values relates to the devel package, not the release package, right? I routinely used biomaRt to run queries of thousands of search values.
The 500 values has always applied to the queries sent to BioMart, either via biomaRt or using the Ensembl web interface. For the most part you can submit more than 500 filter values and it will be fine, but if there is a problem you won't know anything about it - it happens silently.
This is obviously really undesirably, hence the patch. I only commited this to the devel branch incase it broke some other functionality, but noone has reported anything, and it's now part of the new release branch that was released this week.
If you are submitting queries with thousands of gene IDs or the like you should definitely be using biomaRt version 2.33.1 or newer just to be on the safe side.