Error in getBM using biomaRt
1
0
Entering edit mode
Oskar ▴ 50
@oskar-13385
Last seen 5.9 years ago

Hi bioconductor support team,

I had run the same script for getting SNPs ID from a genome range by using biomaRt. Despite being able of getting the information needed for several ranges (about 26), there are two of them that didn't work.

Here is the script I've been using.

snp_mart = useMart(biomart = "ENSEMBL_MART_SNP", 
               dataset = "hsapiens_snp",
               host = "grch37.ensembl.org")

snp_id9 = getBM(attributes = c("refsnp_id", "allele","chr_name", "chrom_start", "chrom_end", "chrom_strand"), 
                            filters = c("chr_name", "start", "end"), 
                            values = list(9, 106356922, 107356689), 
                            mart = snp_mart)

snp_id8 = getBM(attributes = c("refsnp_id", "allele","chr_name", "chrom_start", "chrom_end", "chrom_strand"), 
                            filters = c("chr_name", "start", "end"), 
                            values = list(8, 143262170, 144261715), 
                            mart = snp_mart)

I am getting two Error messages when I re-run it - sometimes the first and sometimes the second. As I mentioned, I used exactly the same thing for 26 different ranges and everything went smooth.

Here the first one: Error in getBM(attributes = c("refsnpid", "allele", "chrname", "chrom_start", : The query to the BioMart webservice returned an invalid result: biomaRt expected a character string of length 1. Please report this on the support site at http://support.bioconductor.org

Here the second one: Error in curl::curlfetchmemory(url, handle = handle) : Timeout was reached: Operation timed out after 600000 milliseconds with 77645 bytes received

I would like to get some enlightenment to solve this issue. Thank you very much for your help in advance.

biomaRt • 5.1k views
ADD COMMENT
1
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 21 hours ago
EMBL Heidelberg

I'm not sure why you get differing error message, but I think the root cause is that these queries take too long to run and time out. When using the Ensembl web interface you get 5 minutes before a query dies, and you currently are allowed 10 minutes via biomaRt.

Query time is broadly related to the number of attributes you're returning and the number of values provided to the filters. In this case I think it's the size of the regions that is causing the issue. One approach would be to break the query down into smaller regions, submit each one, and then piece the results back together. Here's an example for your snp_id8:

## create a matrix where each row is a 100kb region
s1 <- seq(143262170, 144261715, by = 100000)
s2 <- c(s1[-1]-1, 144261715)
regions <- matrix(c(rep(chr, length(s1)), s1, s2), ncol = 3)

## wrapper function to be applied to each row of our matrix  
getBM_values <- function(values) {
  getBM(attributes = c("refsnp_id", "allele","chr_name", "chrom_start", "chrom_end", "chrom_strand"), 
        filters = c("chr_name", "start", "end"), 
        values = list(values[1], values[2], values[3]), 
        mart = snp_mart)
}

## query Ensembl for each region & combine results
res_list <- apply(regions, 1, getBM_values)
snp_id8 <- do.call(rbind, res_list)
> dim(snp_id8 )
[1] 218725      6
ADD COMMENT
1
Entering edit mode

Hi Mike - many thanks for your help. It works!!! I've tried creating blocks but not as matrix. Your strategy works just perfect. Than you very much indeed.

ADD REPLY
0
Entering edit mode

Hi Mke - I think there is something wrong with both chromosome 8 and 9. I have retrieved the SNP IDs based on genome coordinates for chromosomes 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 18, 19, 20, 22, but I can't do it for chromosomes 8 and 9. I have also tried your script and mines in another terminal - considering the PC is the issue, but it wasn't successful as well.

I am getting this error:

> Error in getBM(attributes = c("refsnp_id", "allele", "chr_name", "chrom_start",  : 
  The query to the BioMart webservice returned an invalid result: biomaRt expected a character string of length 1. 
Please report this on the support site at http://support.bioconductor.org
Called from: getBM(attributes = c("refsnp_id", "allele", "chr_name", 
    "chrom_start", "chrom_end", "chrom_strand"), 
    filters = c("chr_name", "start", "end"), 
    values = list(values[1], values[2], values[3]), mart = snp_mart)
Browse[1]> Q
ADD REPLY
0
Entering edit mode

Also, this debugging function pop up message:

function (attributes, filters = "", values = "", mart, curl = NULL, 
  checkFilters = TRUE, verbose = FALSE, uniqueRows = TRUE, 
  bmHeader = FALSE, quote = "\"") 
{
  martCheck(mart)
  if (missing(attributes)) 
    stop("Argument 'attributes' must be specified.")
  if (is.list(filters) && !missing(values)) 
    warning("Argument 'values' should not be used when argument 'filters' is a list and will be ignored.")
  if (is.list(filters) && is.null(names(filters))) 
    stop("Argument 'filters' must be a named list when sent as a list.")
  if (!is.list(filters) && filters != "" && missing(values)) 
    stop("Argument 'values' must be specified.")
  if (length(filters) > 0 && length(values) == 0) 
    stop("Values argument contains no data.")
  if (is.list(filters)) {
    values = filters
    filters = names(filters)
  }
  if (class(uniqueRows) != "logical") 
    stop("Argument 'uniqueRows' must be a logical value, so either TRUE or FALSE")
  callHeader <- TRUE
  xmlQuery = paste0("<Query virtualSchemaName="", 
    martVSchema(mart), "" uniqueRows="", as.numeric(uniqueRows), 
    "" count="0" datasetConfigVersion="0.6" header="", 
    as.numeric(callHeader), "" requestid="biomaRt"> <Dataset name="", 
    martDataset(mart), "">")
  invalid = !(attributes %in% listAttributes(mart, what = "name"))
  if (any(invalid)) 
    stop(paste("Invalid attribute(s):", paste(attributes[invalid], 
      collapse = ", "), "\nPlease use the function 'listAttributes' to get valid attribute names"))
  attributeXML = paste("<Attribute name="", attributes, 
    ""/>", collapse = "", sep = "")
  if (filters[1] != "" && checkFilters) {
    invalid = !(filters %in% listFilters(mart, what = "name"))
    if (any(invalid)) 
      stop(paste("Invalid filters(s):", paste(filters[invalid], 
        collapse = ", "), "\nPlease use the function 'listFilters' to get valid filter names"))
  }
  filterXmlList <- .generateFilterXML(filters, values, mart)
  resultList <- list()
  if (length(filterXmlList) > 1) {
    pb <- progress_bar$new(total = length(filterXmlList), 
      width = options()$width - 10, format = "Batch submitting query [:bar] :percent eta: :eta")
    pb$tick(0)
  }
  for (i in seq_along(filterXmlList)) {
    if (exists("pb")) {
      pb$tick()
    }
    filterXML <- filterXmlList[[i]]
    fullXmlQuery = paste(xmlQuery, attributeXML, filterXML, 
      "</Dataset></Query>", sep = "")
    if (verbose) {
      message(fullXmlQuery)
    }
    sep <- ifelse(grepl(x = martHost(mart), pattern = ".+\\?.+"), 
      "&", "?")
    postRes <- .submitQueryXML(host = paste0(martHost(mart), 
      sep), query = fullXmlQuery)
    if (verbose) {
      writeLines("#################\nResults from server:")
      print(postRes)
    }
    if (!(is.character(postRes) && (length(postRes) == 1L))) 
      stop("The query to the BioMart webservice returned an invalid result: biomaRt expected a character string of length 1. \nPlease report this on the support site at http://support.bioconductor.org")
    if (gsub("\n", "", postRes, fixed = TRUE, useBytes = TRUE) == 
      "") {
      result = as.data.frame(matrix("", ncol = length(attributes), 
        nrow = 0), stringsAsFactors = FALSE)
    }
    else {
      if (length(grep("^Query ERROR", postRes)) > 0L) 
        stop(postRes)
      con = textConnection(postRes)
      result = read.table(con, sep = "\t", header = callHeader, 
        quote = quote, comment.char = "", check.names = FALSE, 
        stringsAsFactors = FALSE)
      if (verbose) {
        writeLines("#################\nParsed results:")
        print(result)
      }
      close(con)
      if (!(is(result, "data.frame") && (ncol(result) == 
        length(attributes)))) {
        print(head(result))
        stop("The query to the BioMart webservice returned an invalid result: the number of columns in the result table does not equal the number of attributes in the query. \nPlease report this on the support site at http://support.bioconductor.org")
      }
    }
    resultList[[i]] <- .setResultColNames(result, mart = mart, 
      attributes = attributes, bmHeader = bmHeader)
  }
  result <- do.call("rbind", resultList)
  return(result)
}

I really appreciate any help in this matter.

ADD REPLY
0
Entering edit mode

Ensembl's BioMart seems to be exceptionally slow for me at the moment - I suspect you're encountering the same problem as before, but even our smaller windows are timing out.

Maybe you can try the REST API instead. The following code works similarly to before, but uses the REST API rather than BioMart to retrieve the data, and seem much faster to me.

library(httr)
library(jsonlite)

chr <- 8
s1 <- seq(143262170, 144261715, by = 10000)
s2 <- c(s1[-1]-1, 144261715)
regions <- matrix(c(rep(chr, length(s1)), s1, s2), ncol = 3)

get_snps_via_rest <- function(values) {
  server <- "https://grch37.rest.ensembl.org"
  ext <- paste0("/overlap/region/human/", values[1], ":", values[2], "-", values[3], "?feature=variation;")
  r <- GET(paste(server, ext, sep = ""), content_type("application/json"))
  fromJSON(content(r, as = "text", encoding = "UTF-8"))
}

res_list2 <- apply(regions, 1, get_snps_via_rest)
snp_id8 <- do.call(rbind, res_list2)

You don't have quite the same control over the output columns, but I think everything you want should still be present:

>  head(snp_id8)
  alleles   consequence_type source           id strand       end assembly_name feature_type     start clinical_significance seq_region_name
1    C, T intergenic_variant  dbSNP  rs139849327      1 143262170        GRCh37    variation 143262170                  NULL               8
2    G, C intergenic_variant  dbSNP rs1337146651      1 143262177        GRCh37    variation 143262177                  NULL               8
3 G, A, T intergenic_variant  dbSNP  rs549325531      1 143262179        GRCh37    variation 143262179                  NULL               8
4    G, A intergenic_variant  dbSNP rs1222699542      1 143262183        GRCh37    variation 143262183                  NULL               8
5    T, A intergenic_variant  dbSNP  rs896634493      1 143262190        GRCh37    variation 143262190                  NULL               8
6    G, A intergenic_variant  dbSNP  rs187036709      1 143262193        GRCh37    variation 143262193                  NULL               8
ADD REPLY

Login before adding your answer.

Traffic: 677 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6