Incomplete GWAS Catalog Data from makeCurrentGwascat()
1
0
Entering edit mode
anailis • 0
@user-24555
Last seen 3.7 years ago

Note: this post is also on Biostars. The suggestion I got there was that my internet was failing, but there is no error message to indicate this is the case and I consistently get 6427 records, so I am on the fence about whether this is the reason. If it is, does anyone have advice on a fix or alternative that's not "get better internet"?

I want to query GWAS Catalog using the gwascat package in R. I was surprised to see makeCurrentGwasCat() returns only 6,427 associations when there are many more in GWAS Catalog. Is this what I am meant to be observing, or is something going wrong here?

> cat1 <- makeCurrentGwascat()
running read.delim on http://www.ebi.ac.uk/gwas/api/search/downloads/alternative...
formatting gwaswloc instance...
NOTE: input data had non-ASCII characters replaced by '*'.
Warning message:
In gwdf2GRanges(tab, extractDate = as.character(Sys.Date())) :
  NAs introduced by coercion
> cat1
gwasloc instance with 6427 records and 38 attributes per record.
Extracted:  2021-01-12 
Genome:  GRCh38 
Excerpt:
GRanges object with 5 ranges and 3 metadata columns:
      seqnames    ranges strand |                 DISEASE/TRAIT        SNPS   P-VALUE
         <Rle> <IRanges>  <Rle> |                   <character> <character> <numeric>
  [1]       22  41151150      * | General risk tolerance (MTAG)  rs75843224     6e-14
  [2]        1 207861610      * | General risk tolerance (MTAG)    rs984983     6e-14
  [3]        2  59787624      * | General risk tolerance (MTAG)   rs6732097     6e-14
  [4]       12 102069362      * | General risk tolerance (MTAG)  rs17437668     9e-14
  [5]        6  26173250      * | General risk tolerance (MTAG)  rs34661691     9e-14
  -------
  seqinfo: 23 sequences from GRCh38 genome

Contrast this to the data that comes with the package from 2016 which has more associations:

data(ebicat38)
ebicat38
gwasloc instance with 22714 records and 36 attributes per record.
Extracted:  2016-01-18 
Genome:  GRCh38 
Excerpt:
GRanges object with 5 ranges and 3 metadata columns:
      seqnames    ranges strand |                  DISEASE/TRAIT        SNPS   P-VALUE
         <Rle> <IRanges>  <Rle> |                    <character> <character> <numeric>
  [1]       11  41798900      * | Post-traumatic stress disorder  rs10768747     5e-06
  [2]       15  34768262      * | Post-traumatic stress disorder  rs12232346     2e-06
  [3]        8  96500749      * | Post-traumatic stress disorder   rs2437772     6e-06
  [4]        9  98221544      * | Post-traumatic stress disorder   rs7866350     1e-06
  [5]       15  54423444      * | Post-traumatic stress disorder  rs73419609     6e-06
  -------
  seqinfo: 23 sequences from GRCh38 genome

My session info:

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gwascat_2.18.0                          Homo.sapiens_1.3.1                      TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 org.Hs.eg.db_3.10.0                    
 [5] GO.db_3.10.0                            OrganismDbi_1.28.0                      GenomicFeatures_1.38.2                  GenomicRanges_1.38.0                   
 [9] GenomeInfoDb_1.22.1                     AnnotationDbi_1.48.0                    IRanges_2.20.2                          S4Vectors_0.24.4                       
[13] Biobase_2.46.0                          BiocGenerics_0.32.0                    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5                  lattice_0.20-41             prettyunits_1.1.1           Rsamtools_2.2.3             Biostrings_2.54.0           assertthat_0.2.1           
 [7] digest_0.6.27               asreml_4.1.0.110            BiocFileCache_1.10.2        R6_2.5.0                    RSQLite_2.2.2               httr_1.4.2                 
[13] ggplot2_3.3.3               pillar_1.4.7                zlibbioc_1.32.0             rlang_0.4.10                progress_1.2.2              curl_4.3                   
[19] rstudioapi_0.13             data.table_1.13.6           blob_1.2.1                  Matrix_1.2-18               BiocParallel_1.20.1         stringr_1.4.0              
[25] RCurl_1.98-1.2              bit_4.0.4                   biomaRt_2.42.1              munsell_0.5.0               DelayedArray_0.12.3         compiler_3.6.2             
[31] rtracklayer_1.46.0          pkgconfig_2.0.3             askpass_1.1                 openssl_1.4.3               tidyselect_1.1.0            SummarizedExperiment_1.16.1
[37] tibble_3.0.4                GenomeInfoDbData_1.2.2      matrixStats_0.57.0          XML_3.99-0.3                crayon_1.3.4                dplyr_1.0.2                
[43] dbplyr_2.0.0                GenomicAlignments_1.22.1    bitops_1.0-6                rappdirs_0.3.1              RBGL_1.62.1                 grid_3.6.2                 
[49] gtable_0.3.0                lifecycle_0.2.0             DBI_1.1.0                   magrittr_2.0.1              scales_1.1.1                graph_1.64.0               
[55] stringi_1.5.3               XVector_0.26.0              ellipsis_0.3.1              generics_0.1.0              vctrs_0.3.6                 tools_3.6.2                
[61] bit64_4.0.5                 glue_1.4.2                  purrr_0.3.4                 hms_0.5.3                   colorspace_2.0-0            BiocManager_1.30.10        
[67] memoise_1.1.0

Thanks all.

gwascat r • 1.2k views
ADD COMMENT
1
Entering edit mode
@vincent-j-carey-jr-4
Last seen 9 weeks ago
United States

Your observation is correct. I would advise you to use a current version of R (at least 4.0). This is a correct result:

> library(gwascat)
1/70 packages newly attached/loaded, see sessionInfo() for details.
> options(timeout=360)
> cur = makeCurrentGwascat()
trying URL 'http://www.ebi.ac.uk/gwas/api/search/downloads/alternative'
downloaded 142.6 MB

|==================================================================| 100% 142 MB
Warning: 5260 parsing failures.
  row            col               expected                actual                                file
   72 SNP_ID_CURRENT no trailing characters 2162231-C             '/tmp/Rtmpm8XCTA/file413987a497c74'
 4542 SNP_ID_CURRENT no trailing characters 7769879-?             '/tmp/Rtmpm8XCTA/file413987a497c74'
19088 CHR_POS        no trailing characters 24486138 x 29201690   '/tmp/Rtmpm8XCTA/file413987a497c74'
19089 CHR_POS        no trailing characters 138645814 x 118244643 '/tmp/Rtmpm8XCTA/file413987a497c74'
19090 CHR_POS        no trailing characters 118661955 x 170402454 '/tmp/Rtmpm8XCTA/file413987a497c74'
..... .............. ...................... ..................... ...................................
See problems(...) for more details.

formatting gwaswloc instance...
NOTE: input data had non-ASCII characters replaced by '*'.
done.
> cur
gwasloc instance with 216521 records and 38 attributes per record.
Extracted:  2021-01-13 
metadata()$badpos includes records for which no unique locus was given.
Genome:  GRCh38 
Excerpt:
GRanges object with 5 ranges and 3 metadata columns:
      seqnames    ranges strand |          DISEASE/TRAIT        SNPS   P-VALUE
         <Rle> <IRanges>  <Rle> |            <character> <character> <numeric>
  [1]       22  41151150      * | General risk toleran..  rs75843224     6e-14
  [2]        1 207861610      * | General risk toleran..    rs984983     6e-14
  [3]        2  59787624      * | General risk toleran..   rs6732097     6e-14
  [4]       12 102069362      * | General risk toleran..  rs17437668     9e-14
  [5]        6  26173250      * | General risk toleran..  rs34661691     9e-14
  -------
  seqinfo: 24 sequences from GRCh38 genome

I cannot guarantee that the timeout option setting given above will help you but it is worth a try. My sessionInfo() result, which corresponds to a valid installation of all packages according to BiocManager::valid(), is

> sessionInfo()
R version 4.0.2 Patched (2020-07-19 r78892)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS (fossa-melisa X20)

Matrix products: default
BLAS:   /home/stvjc/R-4-0-dist/lib/R/lib/libRblas.so
LAPACK: /home/stvjc/R-4-0-dist/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gwascat_2.22.0 rmarkdown_2.6 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5                  lattice_0.20-41            
 [3] prettyunits_1.1.1           Rsamtools_2.6.0            
 [5] Biostrings_2.58.0           assertthat_0.2.1           
 [7] digest_0.6.27               BiocFileCache_1.14.0       
 [9] R6_2.5.0                    GenomeInfoDb_1.26.2        
[11] stats4_4.0.2                RSQLite_2.2.2              
[13] evaluate_0.14               httr_1.4.2                 
[15] pillar_1.4.7                zlibbioc_1.36.0            
[17] rlang_0.4.10                GenomicFeatures_1.42.1     
[19] progress_1.2.2              curl_4.3                   
[21] blob_1.2.1                  S4Vectors_0.28.1           
[23] Matrix_1.3-2                startup_0.15.0             
[25] splines_4.0.2               BiocParallel_1.24.1        
[27] readr_1.4.0                 stringr_1.4.0              
[29] RCurl_1.98-1.2              bit_4.0.4                  
[31] biomaRt_2.46.0              DelayedArray_0.16.0        
[33] rtracklayer_1.50.0          compiler_4.0.2             
[35] xfun_0.20                   askpass_1.1                
[37] pkgconfig_2.0.3             BiocGenerics_0.36.0        
[39] htmltools_0.5.1             openssl_1.4.3              
[41] tidyselect_1.1.0            SummarizedExperiment_1.20.0
[43] tibble_3.0.4                GenomeInfoDbData_1.2.4     
[45] IRanges_2.24.1              matrixStats_0.57.0         
[47] XML_3.99-0.5                crayon_1.3.4               
[49] dplyr_1.0.2                 dbplyr_2.0.0               
[51] GenomicAlignments_1.26.0    bitops_1.0-6               
[53] rappdirs_0.3.1              grid_4.0.2                 
[55] lifecycle_0.2.0             DBI_1.1.0                  
[57] magrittr_2.0.1              stringi_1.5.3              
[59] XVector_0.30.0              xml2_1.3.2                 
[61] snpStats_1.40.0             ellipsis_0.3.1             
[63] generics_0.1.0              vctrs_0.3.6                
[65] tools_4.0.2                 bit64_4.0.5                
[67] BSgenome_1.58.0             Biobase_2.50.0             
[69] glue_1.4.2                  purrr_0.3.4                
[71] hms_0.5.3                   MatrixGenerics_1.2.0       
[73] survival_3.2-7              parallel_4.0.2             
[75] AnnotationDbi_1.52.0        BiocManager_1.30.10        
[77] GenomicRanges_1.42.0        memoise_1.1.0              
[79] knitr_1.30                  VariantAnnotation_1.36.0
0
Entering edit mode

Updating R and redownloading all of the packages solved my problem. Less of an internet issue, more of a me being lazy issue :) Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6