Question

The query to the BioMart webservice returned an invalid result: biomaRt expected a character string of length 1

1

Entering edit mode

steppydeklin ▴ 10

@steppydeklin-9271

Last seen 6.0 years ago

European Union

Hello,

I am trying to download all the annotations from Ensembl using biomaRt, using the following code:

ensembl = useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset = "mmusculus_gene_ensembl")

inputNames = read.table("/Volumes/project_svincent/raw_data/Ensembl_95_mm10_GRCm38p6.tsv",header=TRUE,sep="\t", fill=TRUE,quote="\"",stringsAsFactors = FALSE)$Gene.stable.ID

attributes_protein_coding = c("ensembl_gene_id",
           "ucsc",
           "external_gene_name",
           "chromosome_name",
           "strand",
           "start_position",
           "end_position",
           "ensembl_transcript_id",
           "transcription_start_site",
           "ensembl_exon_id",
           "refseq_mrna",
           "refseq_mrna_predicted")

ensembl_protein_coding = dplyr::tbl_df(getBM(attributes = attributes_protein_coding, filters = "ensembl_gene_id", values = inputNames, mart = ensembl))

I recover this error message... Does anyone has an idea about what is wrong?

> ensemblproteincoding = dplyr::tbldf(getBM(attributes = attributesproteincoding, filters = "ensemblgeneid", values = inputNames, mart = ensembl)) Batch submitting query [==========>--------------------------------------------------------------------------------------------------------------------------------------------] 7% eta: 22mError in getBM(attributes = attributesproteincoding, filters = "ensemblgene_id", : The query to the BioMart webservice returned an invalid result: biomaRt expected a character string of length 1. Please report this on the support site at http://support.bioconductor.org

Thanks Stéphane

biomaRt error • 1.2k views

ADD COMMENT • link updated 6.1 years ago by James W. MacDonald 68k • written 6.1 years ago by steppydeklin ▴ 10

score 0 · Answer 1 · 2019-04-09

Mike Smith will probably be along in a bit with a direct answer, but I wonder if you are sort of duplicating work that has already been done by Johannes Rainier, when he builds the EnsDb packages:

> library(AnnotationHub)

> hub <- AnnotationHub()
> query(hub, c("mus musculus","ensdb"))
AnnotationHub with 8 records
# snapshotDate(): 2018-10-24 
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53222"]]' 

            title                            
  AH53222 | Ensembl 87 EnsDb for Mus Musculus
  AH53726 | Ensembl 88 EnsDb for Mus Musculus
  AH56691 | Ensembl 89 EnsDb for Mus Musculus
  AH57770 | Ensembl 90 EnsDb for Mus Musculus
  AH60788 | Ensembl 91 EnsDb for Mus Musculus
  AH60992 | Ensembl 92 EnsDb for Mus Musculus
  AH64461 | Ensembl 93 EnsDb for Mus Musculus
  AH64944 | Ensembl 94 EnsDb for Mus musculus
> musdb <- hub[["AH64944"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%


> gns <- transcriptsBy(musdb)

> gns
GRangesList object of length 55341:
$ENSMUSG00000000001 
GRanges object with 1 range and 8 metadata columns:
      seqnames              ranges strand |              tx_id     tx_biotype
         <Rle>           <IRanges>  <Rle> |        <character>    <character>
  [1]        3 108107280-108146146      - | ENSMUST00000000001 protein_coding
      tx_cds_seq_start tx_cds_seq_end            gene_id tx_support_level
             <integer>      <integer>        <character>        <integer>
  [1]        108109422      108146005 ENSMUSG00000000001                1
             tx_id_version            tx_name
               <character>        <character>
  [1] ENSMUST00000000001.4 ENSMUST00000000001

$ENSMUSG00000000003 
GRanges object with 2 ranges and 8 metadata columns:
      seqnames            ranges strand |              tx_id     tx_biotype
  [1]        X 77837901-77853623      - | ENSMUST00000000003 protein_coding
  [2]        X 77837902-77853530      - | ENSMUST00000114041 protein_coding
      tx_cds_seq_start tx_cds_seq_end            gene_id tx_support_level
  [1]         77841883       77853483 ENSMUSG00000000003                1
  [2]         77841883       77853483 ENSMUSG00000000003                2
              tx_id_version            tx_name
  [1] ENSMUST00000000003.13 ENSMUST00000000003
  [2]  ENSMUST00000114041.2 ENSMUST00000114041

$ENSMUSG00000000028 
GRanges object with 4 ranges and 8 metadata columns:
      seqnames            ranges strand |              tx_id      tx_biotype
  [1]       16 18807356-18811987      - | ENSMUST00000115585  protein_coding
  [2]       16 18780447-18811972      - | ENSMUST00000000028  protein_coding
  [3]       16 18780453-18811626      - | ENSMUST00000096990  protein_coding
  [4]       16 18810108-18811591      - | ENSMUST00000231819 retained_intron
      tx_cds_seq_start tx_cds_seq_end            gene_id tx_support_level
  [1]         18807356       18811565 ENSMUSG00000000028                2
  [2]         18781898       18811565 ENSMUSG00000000028                1
  [3]         18781898       18811565 ENSMUSG00000000028                1
  [4]             <NA>           <NA> ENSMUSG00000000028             <NA>
              tx_id_version            tx_name
  [1]  ENSMUST00000115585.1 ENSMUST00000115585
  [2] ENSMUST00000000028.13 ENSMUST00000000028
  [3]  ENSMUST00000096990.9 ENSMUST00000096990
  [4]  ENSMUST00000231819.1 ENSMUST00000231819

...
<55338 more elements>
-------
seqinfo: 117 sequences from GRCm38 genome

Which has pretty much everything but the mappings to NCBI IDs, which I would argue is a non-trivial exercise, given the differences between NCBI and EBI/EMBL.

And if you want to do some tidyverse sorcery on the results, you can always unlist that GRangesList, or convert to a DataFrame or a data.frame or (shudders) a tibble.

OR if you just wanted a DB to make queries on, you can always make direct SQL queries on the underlying SQLite DB:

> DBI::dbListTables(dbconn(musdb))
 [1] "chromosome"     "entrezgene"     "exon"           "gene"          
 [5] "metadata"       "protein"        "protein_domain" "tx"            
 [9] "tx2exon"        "uniprot"