Question

Annotating outdated SNP identifiers

0

Entering edit mode

sandmann.t ▴ 70

@sandmannt-11014

Last seen 18 months ago

United States

Dear Bioconductors,

I have a long list of dbSNP identifiers (e.g. rs7335199), some of which are outdated and now represented by a new identifier.

For example, the following call to ensembl's REST API reveals that variant rs7335199 is now referred to as rs3.

curl 'https://rest.ensembl.org/variant_recoder/human/rs7335199?fields=id' -H 'Content-type:application/json'

[{"input":"rs7335199","id":["rs3"]}]

My goal is to retrieve the genome coordinates (GRCh38) and, if available, the current variant identifier for millions of variants. So far, I have tried

ensembl's REST service: great for smaller queries, but not for millions of variants
the biomaRt Bioconductor package: works great, but takes a long time to query
the SNPlocs.Hsapiens.dbSNP150.GRCh38 Bioconductor package: contains the coordinates for up-to-date variants (e.g. rs3) but not outdated ones (e.g. rs7335199).

library(SNPlocs.Hsapiens.dbSNP150.GRCh38)
snps <- SNPlocs.Hsapiens.dbSNP150.GRCh38

snpsById(snps, c("rs3", "rs7335199"), ifnotfound = "drop")
GPos object with 1 position and 2 metadata columns:
      seqnames       pos strand |   RefSNP_id alleles_as_ambig
         <Rle> <integer>  <Rle> | <character>      <character>
  [1]       13  31872705      * |         rs3                Y
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38.p7 genome

What is the recommended way to obtain

the current dbSNP identifier and
their genomic coordinates for a mixed list of current and deprecated variant ids?

Any pointers would be great!

Many thanks,

Thomas

variants variantannotation SNPlocs.Hsapiens.dbSNP150.GRCh38 biomaRt • 1.4k views

ADD COMMENT • link 7.1 years ago sandmann.t ▴ 70

0

Entering edit mode

Time taken for biomaRt queries tends to scale exponentially with the number of values you're asking for. Perhaps you could use biomaRt only for the ID conversion, and then SNPlocs.Hsapiens.dbSNP150.GRCh38 for the coordinates?

I guess if you have 1 million SNPs biomaRt will batch that into 2000 separate queries - even at 1 second per query that's over 30 mins, I can see why you want something quicker if this is more than a one-off thing.

ADD REPLY • link 7.1 years ago Mike Smith ★ 6.6k