How to find rsID with biomaRt in R
2
1
Entering edit mode
Federica ▴ 10
@federica-24874
Last seen 3.6 years ago
Italy

Hi,

I've to find rsID from chromosome positions and chromosome names of several SNPs. I've found biomaRt package in R to solve my problem.

I've the following .txt file:

CHR chr_start   chr_end
  8 101592213 101592213
  8 106973048 106973048
  8 108690829 108690829
  8 102569817 102569817
  8 108580746 108580746

I've loaded the biomaRt package and then I've created the Mart object with the following command:

snp_mart <- useMart(biomart="ENSEMBL_MART_SNP", host="grch37.ensembl.org", 
+                     path="/biomart/martservice", dataset="hsapiens_snp")

Then, I created the following function to get rsIDs, alleles, and chromosome positions:

> rescue_rsid <- function(db) {
+   
+   temp <- getBM(attributes = c('refsnp_id', 'allele', 'chrom_start'), 
+                 filters = c('chr_name', 'start', 'end'), 
+                 values = list(db[,which(colnames(db) == "CHR")], db[,which(colnames(db) == "chr_start")], db[,which(colnames(db) == "chr_end")]), 
+                 mart = snp_mart)
+   return(temp) 
+ }

Then, I want to apply the previous function to all my SNPs with the following command:

apply(ds4, 1, rescue_rsid)

where ds4 is the dataframe with my filters, but I found the following results in Rconsole:

> rescue_rsid(ds4)
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached: [grch37.ensembl.org:80] Operation timed out after 300012 milliseconds with 12109 bytes received
Called from: curl::curl_fetch_memory(url, handle = handle)
Browse[1]> rescue_rsid(ds4[1,])
   refsnp_id allele chrom_start
1 rs62513865    C/T   101592213
Browse[1]> 

However, when I run the following command:

> rescue_rsid(ds4[1,])

Rconsole reported the following result:

  refsnp_id allele chrom_start
1 rs62513865    C/T   101592213

So, this would mean that the function searches rsID but I don't understand why it doesn't work for all my SNPs...

Could someone help me? Thank u!

My R session:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252    LC_MONETARY=Italian_Italy.1252
[4] LC_NUMERIC=C                   LC_TIME=Italian_Italy.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.44.4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6           pillar_1.4.7         compiler_4.0.3       dbplyr_2.1.0         prettyunits_1.1.1   
 [6] tools_4.0.3          progress_1.2.2       bit_4.0.4            tibble_3.0.6         RSQLite_2.2.3       
[11] memoise_2.0.0        BiocFileCache_1.12.1 lifecycle_1.0.0      pkgconfig_2.0.3      rlang_0.4.10        
[16] DBI_1.1.1            curl_4.3             parallel_4.0.3       fastmap_1.1.0        withr_2.4.1         
[21] stringr_1.4.0        httr_1.4.2           dplyr_1.0.4          xml2_1.3.2           rappdirs_0.3.3      
[26] generics_0.1.0       S4Vectors_0.26.1     vctrs_0.3.6          askpass_1.1          IRanges_2.22.2      
[31] hms_1.0.0            tidyselect_1.1.0     stats4_4.0.3         bit64_4.0.5          glue_1.4.2          
[36] Biobase_2.48.0       R6_2.5.0             AnnotationDbi_1.50.3 XML_3.99-0.5         purrr_0.3.4         
[41] blob_1.2.1           magrittr_2.0.1       ellipsis_0.3.1       BiocGenerics_0.34.0  assertthat_0.2.1    
[46] stringi_1.5.3        openssl_1.4.3        cachem_1.0.3         crayon_1.4.1  
biomaRt • 12k views
ADD COMMENT
2
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 23 hours ago
EMBL Heidelberg

Thanks for the nicely structured question! There's a few different things to address here, which are intertwined, but hopefully we can get your query working. Lets start with the error:

Operation timed out after 300012 milliseconds

This happens because there's a 5 minute time limit on queries to the Ensembl BioMart. You can see it mentioned in the disclaimer at the bottom of the BioMart website, and it's enforced inside the biomaRt code. So seeing this error doesn't necessarily mean your code is wrong; it might just be slow. Equally it could be a busy time and server doesn't respond quickly. Since you're submitting lots of queries inside the apply() it may only be one that's failing and stopping the whole thing.


One thing to note about biomaRt is that it will accept vectors to the values argument, so you don't need to loop (or apply) over all you chromosomal positions. You can do a single call to getBM() and pass the three data.frame columns to the values list.

If you're not using the apply() but actually doing rescue_rsid(ds4) then I think you're already taking advantage of this vectorisation - but that might actually be the cause of the timeout. Generally one large query is faster than lots of small queries, but if the server is being slow it might be hitting the 5 minute limit.

However, for this specific type of queries there's a slightly simpler way if you know there's a filter called chromosomal_position. This takes a colon separated character resenting a region of the form chr:start:end:strand. (You can omit strand here).

I think this should be slightly faster than providing the three arguments separately. Here's an example with your data:

library(biomaRt)

## create example data
ds4 <- data.frame(CHR = c("8", "8", "8", "8", "8"),
                 chr_start = c('101592213', '106973048', '108690829', '102569817', '108580746'),
                 chr_end = c('101592213', '106973048', '108690829', '102569817', '108580746'))

snp_mart <- useEnsembl(biomart="ENSEMBL_MART_SNP", 
                       host="grch37.ensembl.org", 
                       dataset="hsapiens_snp")

## combine the positions in to a single vector
position <- apply(ds4, 1, paste, collapse = ":")
position
#> [1] "8:101592213:101592213" "8:106973048:106973048" "8:108690829:108690829"
#> [4] "8:102569817:102569817" "8:108580746:108580746"

getBM(attributes = c('refsnp_id', 'allele', 'chrom_start'), 
      filters = 'chromosomal_region', 
      values = position, 
      mart = snp_mart)
#>     refsnp_id  allele chrom_start
#> 1  rs62513865     C/T   101592213
#> 2   rs6994300     G/A   102569817
#> 3  rs79643588     G/A   106973048
#> 4 rs138449472 G/A/C/T   108580746
#> 5  rs17396518   T/C/G   108690829

It's important to note that when you submit a vector of values to biomaRt the order they are returned in is not guaranteed to be the same as you submitted them, and any value that didn't get a hit in their database will be missing from the results. So while this is a much more efficient way of querying, you need to do some sanity checking on the results if order or completeness is import to whatever you're doing next.

ADD COMMENT
0
Entering edit mode

Hi Mike,

Thanks for a great explanation of how to do this. I am trying to get rsIDs for a list of SNPs for which I have chrom, start and end position and I used your method. I am a bit confused however, because the results I get have SNPs that weren't in my original query. I am inputting lists of 500 chromosomal regions at a time because it times out otherwise, and the results have around 900 observations each, meaning that hundreds of variants are in the results that aren't in the input.

For example, these are the first 20 variants in my ordered list of chromosomal regions:

> position_head[1:20]
 [1] "1:738223:738223" "1:768448:768448" "1:777232:777232" "1:791853:791853" "1:794332:794332" "1:795222:795222"\
 [7] "1:798801:798801" "1:799499:799499" "1:801536:801536" "1:801661:801661" "1:801680:801680" "1:801858:801858"\
[13] "1:802856:802856" "1:803432:803432" "1:804978:804979" "1:805556:805556" "1:809876:809876" "1:821325:821329"\
[19] "1:824398:824398" "1:828166:828166"

and here are some of the results from the getBM query:

> results_1[11:14,]
refsnp_id                                                      allele chrom_start chrom_end chr_name \
11 rs1639792397                                                    CACCA/CA      801676    801680        1\
12   rs12134490                                                         A/C      801680    801680        1\
13 rs1639795103 TCCTCCCTTTGCACTCGAATCCCGGACATCCTCCTATGTCTCACGTGGTCCTC/TCCTC     **801853    801905        1**\
14   rs17276806                                                         C/T      801858    801858        1

There is a variant with position 1:801853:801905 which was not in my original list. I don't want these extra variants in the results because it's meaning the query takes much longer. Do you know how to make sure that I don't get these results that I didn't request?

My getBM query was:

> results<-getBM(attributes = c('refsnp_id', 'allele','chrom_start','chrom_end','chr_name'), \
>       filters = 'chromosomal_region', \
>       values = position_head, \
>       mart = snp_mart)

Thanks, Anja

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

Does this help explain things?

> library(GenomicRanges)
> library(biomaRt)
Warning message:
package 'biomaRt' was built under R version 4.3.2 
> snp_mart <- useEnsembl(biomart="ENSEMBL_MART_SNP", 
                       host="grch37.ensembl.org", 
                       dataset="hsapiens_snp")
## scan in your positions
> vals <- scan("clipboard","c")
## biomaRt call
> z <- getBM(attributes = c('refsnp_id', 'allele', 'chrom_start','chrom_end','chr_name'), 
      filters = 'chromosomal_region', 
      values = vals, mart = snp_mart)
## convert to GRanges objects
> valgr <- GRanges(sapply(strsplit(vals, ":"), function(x) paste0(x[1], ":", x[2], "-", x[3])))
> valgr
GRanges object with 20 ranges and 0 metadata columns:
       seqnames        ranges strand
          <Rle>     <IRanges>  <Rle>
   [1]        1        738223      *
   [2]        1        768448      *
   [3]        1        777232      *
   [4]        1        791853      *
   [5]        1        794332      *
   ...      ...           ...    ...
  [16]        1        805556      *
  [17]        1        809876      *
  [18]        1 821325-821329      *
  [19]        1        824398      *
  [20]        1        828166      *

> resultgr <- GRanges(z[,5], IRanges(z[,3],z[,4]), rsid = z[,1], allele = z[,2])
> resultgr
GRanges object with 31 ranges and 2 metadata columns:
       seqnames        ranges strand
          <Rle>     <IRanges>  <Rle>
   [1]        1        738223      *
   [2]        1        768448      *
   [3]        1        777232      *
   [4]        1        791853      *
   [5]        1        794332      *
   ...      ...           ...    ...
  [27]        1 821326-821332      *
  [28]        1        821328      *
  [29]        1        821329      *
  [30]        1        824398      *
  [31]        1        828166      *
       |        rsid
       | <character>
   [1] | rs138388092
   [2] |  rs12562034
   [3] | rs112618790
   [4] |   rs6684487
   [5] |  rs12127425
   ... .         ...
  [27] |  rs78941097
  [28] | rs796562407
  [29] | rs202018963
  [30] |   rs7538305
  [31] |  rs13303117
                       allele
                  <character>
   [1]                    T/C
   [2]                    G/A
   [3]                    C/T
   [4]                  G/A/T
   [5]                    G/A
   ...                    ...
  [27] CAGTCAG/CAG/CAGTCAGT..
  [28]                  G/C/T
  [29]                  T/A/G
  [30]                  A/C/T
  [31]                  A/G/T
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Now get overlaps
> fo <- findOverlaps(valgr, resultgr)
> fo
Hits object with 31 hits and 0 metadata columns:
       queryHits subjectHits
       <integer>   <integer>
   [1]         1           1
   [2]         2           2
   [3]         3           3
   [4]         4           4
   [5]         5           5
   ...       ...         ...
  [27]        18          27
  [28]        18          28
  [29]        18          29
  [30]        19          30
  [31]        20          31
  -------
  queryLength: 20 / subjectLength: 31

## why were duplicates returned?
> rsltgrlst <- splitAsList(resultgr, queryHits(fo))
> rsltgrlst[lengths(rsltgrlst) > 1L]
GRangesList object of length 5:
$`11`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 801676-801680      *
  [2]        1        801680      *
      |         rsid      allele
      |  <character> <character>
  [1] | rs1639792397    CACCA/CA
  [2] |   rs12134490         A/C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$`12`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 801853-801905      *
  [2]        1        801858      *
      |         rsid
      |  <character>
  [1] | rs1639795103
  [2] |   rs17276806
                      allele
                 <character>
  [1] TCCTCCCTTTGCACTCGAAT..
  [2]                    C/T
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$`15`
GRanges object with 5 ranges and 2 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 804974-804980      *
  [2]        1 804977-804978      *
  [3]        1        804978      *
  [4]        1        804979      *
  [5]        1 804979-804989      *
      |         rsid
      |  <character>
  [1] | rs1639841302
  [2] | rs1190791124
  [3] | rs1639841366
  [4] |  rs114479279
  [5] |  rs202146812
                      allele
                 <character>
  [1]            TAATTAA/TAA
  [2]                 TT/TTT
  [3]                    T/A
  [4]                    A/T
  [5] AAAAAAAAAAA/AAAAAA/A..
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$`16`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 805548-805576      *
  [2]        1        805556      *
      |         rsid
      |  <character>
  [1] | rs1639852315
  [2] |   rs72631880
                      allele
                 <character>
  [1] GGGCCCCGTGATTATATTTG..
  [2]                  T/A/C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$`18`
GRanges object with 5 ranges and 2 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 821324-821325      *
  [2]        1 821325-821328      *
  [3]        1 821326-821332      *
  [4]        1        821328      *
  [5]        1        821329      *
      |         rsid
      |  <character>
  [1] | rs1640080539
  [2] | rs1640080568
  [3] |   rs78941097
  [4] |  rs796562407
  [5] |  rs202018963
                      allele
                 <character>
  [1]                   AA/A
  [2]                 ACAG/-
  [3] CAGTCAG/CAG/CAGTCAGT..
  [4]                  G/C/T
  [5]                  T/A/G
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENT

Login before adding your answer.

Traffic: 739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6