I would like to obtain GWAS Catalog data on hg19 co-ordinates so I use makeCurrentGwascat(genome = "GRCh37")
. Everything appears to go smoothly and the object says that it is on hg19 co-ordinates. But if you compare the co-ordinates to what they should be, you will quickly realise that they are still on hg38 co-ordinates despite the liftover appearing to complete successfully. For example, rs56106601 hg38's co-ordinates are chr9:128008205 and its hg19 co-ordinates are chr9:130770484 (https://www.ncbi.nlm.nih.gov/snp/rs56106601), but attempt@elementMetadata[attempt@elementMetadata$SNPS=="rs56106601","CHR_POS"]
yields 128008205
- hg38 co-ordinates.
> library(gwascat)
gwascat loaded. Use makeCurrentGwascat() to extract current image.
from EBI. The data folder of this package has some legacy extracts.
> library(liftOver)
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter,
Find, get, grep, grepl, intersect, is.unsorted, lapply, Map,
mapply, match, mget, order, paste, pmax, pmax.int, pmin,
pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
setdiff, sort, table, tapply, union, unique, unsplit,
which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Attaching package: ‘IRanges’
The following object is masked from ‘package:grDevices’:
windows
Loading required package: GenomeInfoDb
Loading required package: rtracklayer
Loading required package: Homo.sapiens
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages
'citation("pkgname")'.
Loading required package: OrganismDbi
Loading required package: GenomicFeatures
Loading required package: GO.db
Loading required package: org.Hs.eg.db
Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene
> attempt <- makeCurrentGwascat(genome = "GRCh37", withOnt = F)
trying URL 'http://www.ebi.ac.uk/gwas/api/search/downloads/full'
Content type 'text/tsv' length unknown
ownloaded 125.9 MB
|============================================================| 100% 126 MB
Warning: 5303 parsing failures.
row col expected actual file
669 CHR_POS no trailing characters 27556782;27543283 'C:\Users\XXX\AppData\Local\Temp\RtmpO0PFpO\file3a1c3aa8659b'
670 CHR_POS no trailing characters 27556782;27543283 'C:\Users\XXX\AppData\Local\Temp\RtmpO0PFpO\file3a1c3aa8659b'
851 CHR_POS no trailing characters 34150806;34141638 'C:\Users\XXX\AppData\Local\Temp\RtmpO0PFpO\file3a1c3aa8659b'
852 CHR_POS no trailing characters 34150806;34141638 'C:\Users\XXX\AppData\Local\Temp\RtmpO0PFpO\file3a1c3aa8659b'
853 CHR_POS no trailing characters 34150806;34141638 'C:\Users\XXX\AppData\Local\Temp\RtmpO0PFpO\file3a1c3aa8659b'
... ....... ...................... ................. .........................................................................
See problems(...) for more details.
formatting gwaswloc instance...
NOTE: input data had non-ASCII characters replaced by '*'.
Loading required namespace: AnnotationHub
starting liftover from GRCh38 to GRCh37
snapshotDate(): 2020-10-27
loading from cache
liftover complete.
done.
> attempt
gwasloc instance with 235055 records and 34 attributes per record.
Extracted: Wed Feb 03 16:30:36 2021
metadata()$badpos includes records for which no unique locus was given.
Genome: GRCh37
Excerpt:
GRanges object with 5 ranges and 3 metadata columns:
seqnames ranges strand | DISEASE/TRAIT SNPS P-VALUE
<Rle> <IRanges> <Rle> | <character> <character> <numeric>
[1] 8 19844222 * | HDL cholesterol rs12678919 2e-34
[2] 18 47167214 * | HDL cholesterol rs4939883 7e-15
[3] 11 116648917 * | HDL cholesterol rs964184 1e-12
[4] 9 107664301 * | HDL cholesterol rs1883025 1e-09
[5] 1 230295691 * | HDL cholesterol rs4846914 4e-08
-------
seqinfo: 24 sequences from GRCh37 genome
> attempt@elementMetadata[attempt@elementMetadata$SNPS=="rs56106601","CHR_POS"]
[1] 128008205
Session info etc:
> BiocManager::valid()
* sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] liftOver_1.14.0
[2] Homo.sapiens_1.3.1
[3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[4] org.Hs.eg.db_3.12.0
[5] GO.db_3.12.1
[6] OrganismDbi_1.32.0
[7] GenomicFeatures_1.42.1
[8] AnnotationDbi_1.52.0
[9] Biobase_2.50.0
[10] rtracklayer_1.49.5
[11] GenomicRanges_1.42.0
[12] GenomeInfoDb_1.26.2
[13] IRanges_2.24.1
[14] S4Vectors_0.28.1
[15] BiocGenerics_0.36.0
[16] gwascat_2.22.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 matrixStats_0.58.0
[3] bit64_4.0.5 progress_1.2.2
[5] httr_1.4.2 tools_4.0.3
[7] R6_2.5.0 DBI_1.1.1
[9] withr_2.4.1 tidyselect_1.1.0
[11] prettyunits_1.1.1 bit_4.0.4
[13] curl_4.3 compiler_4.0.3
[15] graph_1.68.0 pacman_0.5.1
[17] xml2_1.3.2 DelayedArray_0.16.1
[19] readr_1.4.0 RBGL_1.66.0
[21] askpass_1.1 rappdirs_0.3.3
[23] stringr_1.4.0 digest_0.6.27
[25] Rsamtools_2.6.0 XVector_0.30.0
[27] pkgconfig_2.0.3 htmltools_0.5.1.1
[29] MatrixGenerics_1.2.1 dbplyr_2.0.0
[31] fastmap_1.1.0 BSgenome_1.58.0
[33] rlang_0.4.10 rstudioapi_0.13
[35] RSQLite_2.2.3 shiny_1.6.0
[37] generics_0.1.0 BiocParallel_1.24.1
[39] dplyr_1.0.4 VariantAnnotation_1.36.0
[41] RCurl_1.98-1.2 magrittr_2.0.1
[43] GenomeInfoDbData_1.2.4 Matrix_1.3-2
[45] Rcpp_1.0.6 lifecycle_0.2.0
[47] stringi_1.5.3 yaml_2.2.1
[49] SummarizedExperiment_1.20.0 zlibbioc_1.36.0
[51] BiocFileCache_1.14.0 AnnotationHub_2.22.0
[53] grid_4.0.3 blob_1.2.1
[55] promises_1.1.1 snpStats_1.40.0
[57] crayon_1.4.0 lattice_0.20-41
[59] Biostrings_2.58.0 splines_4.0.3
[61] hms_1.0.0 pillar_1.4.7
[63] biomaRt_2.46.2 XML_3.99-0.5
[65] glue_1.4.2 BiocVersion_3.12.0
[67] BiocManager_1.30.10 vctrs_0.3.6
[69] httpuv_1.5.5 openssl_1.4.3
[71] purrr_0.3.4 assertthat_0.2.1
[73] cachem_1.0.1 mime_0.9
[75] xtable_1.8-4 later_1.1.0.1
[77] survival_3.2-7 tibble_3.0.6
[79] GenomicAlignments_1.26.0 memoise_2.0.0
[81] ellipsis_0.3.1 interactiveDisplayBase_1.28.0
Bioconductor version '3.12'
* 1 packages out-of-date
* 0 packages too new
create a valid installation with
BiocManager::install("rtracklayer", update = TRUE, ask = FALSE)
more details: BiocManager::valid()$too_new, BiocManager::valid()$out_of_date
Warning message:
1 packages out-of-date; 0 packages too new
Note that rtracklayer
is still outdated because I'm a Windows user (see https://support.bioconductor.org/p/p133891/).