Entering edit mode
Dear Bioconductors,
I have a long list of dbSNP identifiers (e.g. rs7335199
), some of which are outdated and now represented by a new identifier.
For example, the following call to ensembl's REST API reveals that variant rs7335199
is now referred to as rs3
.
curl 'https://rest.ensembl.org/variant_recoder/human/rs7335199?fields=id' -H 'Content-type:application/json'
[{"input":"rs7335199","id":["rs3"]}]
My goal is to retrieve the genome coordinates (GRCh38) and, if available, the current variant identifier for millions of variants. So far, I have tried
- ensembl's REST service: great for smaller queries, but not for millions of variants
- the
biomaRt
Bioconductor package: works great, but takes a long time to query - the
SNPlocs.Hsapiens.dbSNP150.GRCh38
Bioconductor package: contains the coordinates for up-to-date variants (e.g. rs3) but not outdated ones (e.g. rs7335199).
library(SNPlocs.Hsapiens.dbSNP150.GRCh38)
snps <- SNPlocs.Hsapiens.dbSNP150.GRCh38
snpsById(snps, c("rs3", "rs7335199"), ifnotfound = "drop")
GPos object with 1 position and 2 metadata columns:
seqnames pos strand | RefSNP_id alleles_as_ambig
<Rle> <integer> <Rle> | <character> <character>
[1] 13 31872705 * | rs3 Y
-------
seqinfo: 25 sequences (1 circular) from GRCh38.p7 genome
What is the recommended way to obtain
- the current dbSNP identifier and
- their genomic coordinates for a mixed list of current and deprecated variant ids?
Any pointers would be great!
Many thanks,
Thomas
Time taken for biomaRt queries tends to scale exponentially with the number of values you're asking for. Perhaps you could use biomaRt only for the ID conversion, and then SNPlocs.Hsapiens.dbSNP150.GRCh38 for the coordinates?
I guess if you have 1 million SNPs biomaRt will batch that into 2000 separate queries - even at 1 second per query that's over 30 mins, I can see why you want something quicker if this is more than a one-off thing.