Question

Obtaining rsIDs with ENSG and variant ID information

0

Entering edit mode

bd2000 ▴ 30

@5d657c1d

Last seen 19 days ago

United Kingdom

Hi all,

I'm trying to obtain rsid information on my dataset using biomaRt and I'm not too sure how to go about it. My dataset has over 8,000,000 rows and has the following columns:

phenotype_id   variant_id tss_distance   maf ma_samples ma_count pval_nominal       slope  slope_se
1: ENSG00000112679 6_203909_A_G      -147446 0.065         13       13    0.7262169  0.08764783 0.2494609
2: ENSG00000112679 6_204072_G_T      -147283 0.065         13       13    0.7262169  0.08764783 0.2494609

Is it possible to get rsids with just this information? And if so, how would I go about it? Thank you in advance!

rsid biomaRt • 1.4k views

ADD COMMENT • link updated 17 months ago by Robert Castelo ★ 3.4k • written 17 months ago by bd2000 ▴ 30

score 0 · Answer 1 · 2023-12-01

0

Entering edit mode

Robert Castelo ★ 3.4k

@rcastelo

Last seen 17 days ago

Barcelona/Universitat Pompeu Fabra

You should first and foremost find out what was the human reference genome version from which your dataset was derived. Assuming this was GRCh38, you may use the annotation package SNPlocs.Hsapiens.dbSNP155.GRCh38 to find the rsIDs, as illustrated in this previous answer in this forum to the same question. To build the input GPos object from a data.frame object of the kind you have you may do the following (others in this forum may suggest more compact solutions):

## this just simulates your the two first columns of your input dataset
dat <- data.frame(phehotype_id=c("ENSG00000112679", "ENSG00000112679"),
                  variant_id=c("6_203909_A_G", "6_204072_G_T"))
dat
     phehotype_id   variant_id
1 ENSG00000112679 6_203909_A_G
2 ENSG00000112679 6_204072_G_T
my_snps <- strsplit(dat$variant_id, "_")
my_snps <- GPos(seqnames=sapply(my_snps, "[", 1),
                as.integer(sapply(my_snps, "[", 2)))
my_snps
UnstitchedGPos object with 2 positions and 0 metadata columns:
      seqnames       pos strand
         <Rle> <integer>  <Rle>
  [1]        6    203909      *
  [2]        6    204072      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

and then apply the code in the answer linked above.

ADD COMMENT • link 17 months ago Robert Castelo ★ 3.4k

0

Entering edit mode

Thank you for your answer. I've tried to download the package but it didn't work:

Error in download.file(url, destfile, method, mode = "wb", ...) : 
  download from 'https://bioconductor.org/packages/3.18/data/annotation/src/contrib/SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.24.tar.gz' failed
In addition: Warning messages:
1: In download.file(url, destfile, method, mode = "wb", ...) :
  downloaded length 0 != reported length 0
2: In download.file(url, destfile, method, mode = "wb", ...) :
  URL 'https://bioconductor.org/packages/3.18/data/annotation/src/contrib/SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.24.tar.gz': Timeout of 300 seconds was reached

Is there any other way I can download the package or maybe a different package I can use?

ADD REPLY • link 17 months ago bd2000 ▴ 30

0

Entering edit mode

This is a large package, and when downloading and installing large packages it's often the case to get timeouts. Try setting a longer time out and try installing it again, i.e.:

options(timeout=1200)
BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh38")

ADD REPLY • link 17 months ago Robert Castelo ★ 3.4k