Question

extract rsids of SNPs using their genomic positions

0

Entering edit mode

olgadolgova • 0

@99e71656

Last seen 21 months ago

Spain

I need to get the rsid of SNPs having their genome positions as columns in a text file under the names "chromosome", "start, "end". "start" and "end" has equal values corresponding to the genomic position of a SNP from each row. What package and functions should I use in my case?


# include your problematic code here with any corresponding output
BiocManager::install("SNPlocs.Hsapiens.dbSNP150.GRCh38")
library(SNPlocs.Hsapiens.dbSNP150.GRCh38)
data <- read.table("10_79854257_rsid.txt", header = TRUE)
gr <- GRanges(seqnames = Rle(data$chromosome),
              ranges = IRanges(start = data$start, end = data$end))
rsids <- findGRanges(gr, columns = "rsid")
data$rsid <- rsids
write.table(data, "10_79854257_with_rsids.txt", sep = "\t", quote = FALSE, row.names = FALSE)

#Error in findGRanges(gr, columns = "rsid") : 
  could not find function "findGRanges"

# please also include the results of running the following in an R session 

sessionInfo( )

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.17.0 GenomicFeatures_1.52.1                  
 [3] AnnotationDbi_1.62.2                     VariantAnnotation_1.46.0                
 [5] Rsamtools_2.16.0                         SummarizedExperiment_1.30.2             
 [7] Biobase_2.60.0                           MatrixGenerics_1.12.2                   
 [9] matrixStats_1.0.0                        BSgenome.Hsapiens.UCSC.hg38_1.4.5       
[11] SNPlocs.Hsapiens.dbSNP150.GRCh38_0.99.20 BSgenome_1.68.0                         
[13] rtracklayer_1.60.0                       Biostrings_2.68.1                       
[15] XVector_0.40.0                           GenomicRanges_1.52.0                    
[17] GenomeInfoDb_1.36.1                      IRanges_2.34.1                          
[19] S4Vectors_0.38.1                         BiocGenerics_0.46.0                     

loaded via a namespace (and not attached):
  [1] rstudioapi_0.15.0        magrittr_2.0.3           TH.data_1.1-2           
  [4] rmarkdown_2.23           fs_1.6.2                 BiocIO_1.10.0           
  [7] zlibbioc_1.46.0          vctrs_0.6.3              memoise_2.0.1           
 [10] RCurl_1.98-1.12          base64enc_0.1-3          progress_1.2.2          
 [13] htmltools_0.5.5          S4Arrays_1.0.4           usethis_2.2.2           
 [16] polspline_1.1.22         curl_5.0.1               Formula_1.2-5           
 [19] htmlwidgets_1.6.2        plyr_1.8.8               sandwich_3.0-2          
 [22] zoo_1.8-12               SNPassoc_2.1-1           cachem_1.0.8            
 [25] GenomicAlignments_1.36.0 mime_0.12                lifecycle_1.0.3         
 [28] pkgconfig_2.0.3          Matrix_1.5-4.1           R6_2.5.1                
 [31] fastmap_1.1.1            GenomeInfoDbData_1.2.10  shiny_1.7.4.1           
 [34] digest_0.6.32            colorspace_2.1-0         ps_1.7.5                
 [37] pkgload_1.3.2.1          RSQLite_2.3.1            Hmisc_5.1-0             
 [40] filelock_1.0.2           fansi_1.0.4              httr_1.4.6              
 [43] compiler_4.3.1           remotes_2.4.2            bit64_4.0.5             
 [46] htmlTable_2.4.1          backports_1.4.1          BiocParallel_1.34.2     
 [49] DBI_1.1.3                pkgbuild_1.4.2           biomaRt_2.56.1          
 [52] MASS_7.3-60              quantreg_5.95            rappdirs_0.3.3          
 [55] DelayedArray_0.26.6      sessioninfo_1.2.2        rjson_0.2.21            
 [58] tools_4.3.1              foreign_0.8-84           httpuv_1.6.11           
 [61] nnet_7.3-19              glue_1.6.2               restfulr_0.0.15         
 [64] callr_3.7.3              nlme_3.1-162             promises_1.2.0.1        
 [67] grid_4.3.1               checkmate_2.2.0          cluster_2.1.4           
 [70] generics_0.1.3           gtable_0.3.3             poisbinom_1.0.1         
 [73] tidyr_1.3.0              hms_1.1.3                data.table_1.14.8       
 [76] xml2_1.3.4               utf8_1.2.3               pillar_1.9.0            
 [79] stringr_1.5.0            later_1.3.1              splines_4.3.1           
 [82] dplyr_1.1.2              BiocFileCache_2.8.0      lattice_0.21-8          
 [85] bit_4.0.5                survival_3.5-5           SparseM_1.81            
 [88] tidyselect_1.2.0         rms_6.7-0                miniUI_0.1.1.1          
 [91] knitr_1.43               gridExtra_2.3            xfun_0.39               
 [94] devtools_2.4.5           stringi_1.7.12           yaml_2.3.7              
 [97] evaluate_0.21            codetools_0.2-19         tibble_3.2.1            
[100] BiocManager_1.30.21      cli_3.6.1                rpart_4.1.19            
[103] xtable_1.8-4             munsell_0.5.0            processx_3.8.1          
[106] Rcpp_1.0.10              haplo.stats_1.9.3        dbplyr_2.3.3            
[109] png_0.1-8                XML_3.99-0.14            parallel_4.3.1          
[112] MatrixModels_0.5-1       ellipsis_0.3.2           blob_1.2.4              
[115] ggplot2_3.4.2            prettyunits_1.1.1        arsenal_3.6.3           
[118] profvis_0.3.8            urlchecker_1.0.1         bitops_1.0-7            
[121] mvtnorm_1.2-2            scales_1.2.1             purrr_1.0.1             
[124] crayon_1.5.2             rlang_1.1.1              KEGGREST_1.40.0         
[127] multcomp_1.4-25         
>

rsids SNPlocs.Hsapiens.dbSNP150.GRCh38 • 4.3k views

ADD COMMENT • link written 21 months ago by olgadolgova • 0

score 0 · Answer 1 · 2023-07-10

See ?snpcounts. As an example,

> library(SNPlocs.Hsapiens.dbSNP144.GRCh38)
> snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38
>  my_rsids <- c("rs10458597", "rs12565286", "rs7553394")
> snpsById(snps, my_rsids, ifnotfound = "drop")
UnstitchedGPos object with 2 positions and 2 metadata columns:
      seqnames       pos strand |   RefSNP_id alleles_as_ambig
         <Rle> <integer>  <Rle> | <character>      <character>
  [1]        1    629241      * |  rs10458597                Y
  [2]        1    785910      * |  rs12565286                S
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38.p2 genome

score 0 · Answer 2 · 2023-10-25

Sorry for not seeing this earlier.

To map from position to rsid, use snpsByOverlaps(). First create a GRanges or GPos object my_snps that contains the genomic positions of your SNPs, then do:

library(GenomicRanges)

library(SNPlocs.Hsapiens.dbSNP155.GRCh38)
snps <- SNPlocs.Hsapiens.dbSNP155.GRCh38

known_snps <- snpsByOverlaps(snps, my_snps)
hits <- findOverlaps(my_snps, known_snps)

## A sanity check (unlikely to happen):
if (anyDuplicated(queryHits(hits)))
    warning("some SNPs are mapped to more than 1 known SNP")

## Integer vector that maps the SNPs in 'my_snps' to the SNPs in 'known_snps':
mapping <- selectHits(hits, select="first")

mcols(my_snps)$RefSNP_id <- mcols(known_snps)$RefSNP_id[mapping]

For this to work properly, you need to make sure that:

The SNP positions in my_snps are with respect to reference genome GRCh38.
my_snps uses the same chromosome naming conventions as GRCh38.

For example, with the following SNPs:

my_snps <- GPos(Rle(c("1", "2"), c(3, 2)), pos=c(785910, 900000, 629241, 50, 900047))
my_snps
# UnstitchedGPos object with 4 positions and 0 metadata columns:
#       seqnames       pos strand
#          <Rle> <integer>  <Rle>
#   [1]        1    785910      *
#   [2]        1    900000      *
#   [3]        1    629241      *
#   [4]        2        50      *
#   [5]        2    900047      *
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths

after running the code above, my_snps will become:

my_snps
# UnstitchedGPos object with 4 positions and 1 metadata column:
#       seqnames       pos strand |    RefSNP_id
#          <Rle> <integer>  <Rle> |  <character>
#   [1]        1    785910      * |   rs12565286
#   [2]        1    900000      * |         <NA>
#   [3]        1    629241      * |   rs10458597
#   [4]        2        50      * |         <NA>
#   [5]        2    900047      * | rs1199124244
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Hope this helps,

H.