liftOver bug, reproducible
1
1
Entering edit mode
@mikhail-dozmorov-23744
Last seen 8 days ago
United States
library(GenomicRanges)
library(rtracklayer)
library(liftOver)
library(R.utils)

liftover using R

# Make a GRanges object from this sample range
gr <- GRanges(seqnames = Rle(c("chr3")),
                 ranges = IRanges(start = 195084601, end = 195195500))
# The chain file ("hg19ToHg38.over.chain") can be found at: 
# http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/
path <- "http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz"
download.file(path, "hg19ToHg38.over.chain.gz")
gunzip("hg19ToHg38.over.chain.gz", skip = TRUE)
chain = import.chain("hg19ToHg38.over.chain")

gr_lift = liftOver(gr, chain)
gr_lift[[1]]

This results in 21 ranges for the single given range. Note the start coordinate of the first range and the end coordinate of the last range: 195363872 and 195474797

liftover using the web client:

Navigate to the web based tool @ https://genome.ucsc.edu/cgi-bin/hgLiftOver and submit the following coordinates (same as in above R chunk) in the submission box:

chr3 195084601 195195500 (OR formatted as chr3:195084601-195195500 also works as input.)

Select "Human" in both the Original and New Genome dropdown menus, and select "Feb. 2009 (GRCh37/hg19)" for the Original Assembly dropdown and "Dec. 2013 (GRCh38/hg38)" for the New Assembly. Click Submit

The web tool returns a downloadable text file that should read:

chr3 195363872 195474797 (Or chr3:195363872-195474797, if it was submitted in the alternate format)

Note that these coordinates match first start coordinate and the last end coordinate from the 21 ranges produced by liftover in R.

liftOver in a terminal

Instructions are here: http://genometoolbox.blogspot.com/2013/11/install-liftover-locally-on-unix.html but also summarized below.

Run liftOver on a Mac by downloading it @ http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/ and scrolling down to and downloading the liftOver package. Once downloaded move it to your working directory, and modify the permissions to make executable (i.e. chmod u+x liftOver). Make an input file with the coordinates, (e.g make a file gr.bed containing one line, tab-separated:

chr3    195084601   195195500

To run liftOver, the usage is:

liftOver gr.bed hg19ToHg38.over.chain gr_lift.bed gr_lift.unmapped

This returns the same results as the web tool. The file gr_lift.bed contains

chr3    195363872       195474797
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] knitr_1.31                              R.utils_2.10.1                         
 [3] R.oo_1.24.0                             R.methodsS3_1.8.1                      
 [5] liftOver_1.14.0                         Homo.sapiens_1.3.1                     
 [7] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 org.Hs.eg.db_3.12.0                    
 [9] GO.db_3.12.1                            OrganismDbi_1.32.0                     
[11] GenomicFeatures_1.42.1                  AnnotationDbi_1.52.0                   
[13] Biobase_2.50.0                          gwascat_2.22.0                         
[15] rtracklayer_1.50.0                      GenomicRanges_1.42.0                   
[17] GenomeInfoDb_1.26.2                     IRanges_2.24.1                         
[19] S4Vectors_0.28.1                        BiocGenerics_0.36.0                    

loaded via a namespace (and not attached):
 [1] MatrixGenerics_1.2.1        httr_1.4.2                  splines_4.0.3              
 [4] bit64_4.0.5                 assertthat_0.2.1            askpass_1.1                
 [7] BiocManager_1.30.10         BiocFileCache_1.14.0        RBGL_1.66.0                
[10] blob_1.2.1                  BSgenome_1.58.0             GenomeInfoDbData_1.2.4     
[13] Rsamtools_2.6.0             yaml_2.2.1                  progress_1.2.2             
[16] pillar_1.4.7                RSQLite_2.2.3               lattice_0.20-41            
[19] glue_1.4.2                  digest_0.6.27               XVector_0.30.0             
[22] htmltools_0.5.1.1           Matrix_1.3-2                XML_3.99-0.5               
[25] pkgconfig_2.0.3             biomaRt_2.46.2              zlibbioc_1.36.0            
[28] purrr_0.3.4                 BiocParallel_1.24.1         tibble_3.0.6               
[31] openssl_1.4.3               generics_0.1.0              dbscan_1.1-5               
[34] ellipsis_0.3.1              cachem_1.0.1                SummarizedExperiment_1.20.0
[37] survival_3.2-7              magrittr_2.0.1              crayon_1.4.0               
[40] memoise_2.0.0               evaluate_0.14               xml2_1.3.2                 
[43] graph_1.68.0                tools_4.0.3                 prettyunits_1.1.1          
[46] hms_1.0.0                   lifecycle_0.2.0             matrixStats_0.58.0         
[49] stringr_1.4.0               DelayedArray_0.16.1         snpStats_1.40.0            
[52] Biostrings_2.58.0           compiler_4.0.3              rlang_0.4.10               
[55] grid_4.0.3                  RCurl_1.98-1.2              rstudioapi_0.13            
[58] VariantAnnotation_1.36.0    rappdirs_0.3.3              bitops_1.0-6               
[61] rmarkdown_2.6               DBI_1.1.1                   curl_4.3                   
[64] R6_2.5.0                    GenomicAlignments_1.26.0    dplyr_1.0.4                
[67] fastmap_1.1.0               bit_4.0.4                   readr_1.4.0                
[70] stringi_1.5.3               Rcpp_1.0.6                  vctrs_0.3.6                
[73] dbplyr_2.1.0                tidyselect_1.1.0            xfun_0.20
liftOver rtracklayer • 1.9k views
ADD COMMENT
1
Entering edit mode
@michael-lawrence-3846
Last seen 2.9 years ago
United States

Thanks for the report. This is actually how liftOver() is supposed to work, as it is not a direct port of the liftover command-line tool. It would be interesting if someone could reverse engineer how the liftover tool merges ranges post lift over. Without looking at the code of course, this since that would violate the license. Or we could just come up with our own heuristic.

ADD COMMENT
0
Entering edit mode

More details on this question are available at [https://support.bioconductor.org/p/99306/][1]

One thought is to lift over only start and end coordinates. It can be fragile, I'm not sure how difficult would be to resolve difficult cases.

ADD REPLY
0
Entering edit mode

It seems to be way more complicated than one might think. Here's an issue I ran into trying to liftOver LD blocks from 37 to 38

 > gr <- import("tmp2.bed")
 > gr
 GRanges object with 1 range and 2 metadata columns:
       seqnames        ranges strand |        name     score
          <Rle>     <IRanges>  <Rle> | <character> <numeric>
   [1]     chr1 10583-1892607      * |        <NA>         0
   -------
   seqinfo: 1 sequence from an unspecified genome; no seqlengths
 > chain <- import.chain("hg19ToHg38.over.chain")
 > liftOver(gr, chain)
 GRangesList object of length 1:
 [[1]]
 GRanges object with 35 ranges and 2 metadata columns:
        seqnames          ranges strand |        name     score
           <Rle>       <IRanges>  <Rle> | <character> <numeric>
    [1]     chr1    10583-177376      * |        <NA>         0
    [2]     chr1   257667-297968      * |        <NA>         0
    [3]     chr1  585989-1630687      * |        <NA>         0
    [4]     chr1 1630690-1634405      * |        <NA>         0
    [5]     chr1 1634409-1635542      * |        <NA>         0
    ...      ...             ...    ... .         ...       ...
   [31]     chr1 1715068-1715435      * |        <NA>         0
   [32]     chr1 1715479-1715859      * |        <NA>         0
   [33]     chr1 1715925-1716008      * |        <NA>         0
   [34]     chr1 1716095-1716246      * |        <NA>         0
   [35]    chr19   242824-242864      * |        <NA>         0

 -------
 seqinfo: 2 sequences from an unspecified genome; no seqlengths
 > system("liftOver tmp2.bed hg19ToHg38.over.chain tmp2.lifted.bed umapp")
 Reading liftover chains
 Mapping coordinates
 > import("tmp2.lifted.bed")
 GRanges object with 0 ranges and 0 metadata columns:
    seqnames    ranges strand
       <Rle> <IRanges>  <Rle>
   -------
   seqinfo: no sequences
 > system("liftOver tmp2.bed hg19ToHg38.over.chain tmp2.lifted.bed umapp -multiple")
 Reading liftover chains
 Mapping coordinates
> import("tmp2.lifted.bed")
 GRanges object with 2 ranges and 2 metadata columns:
       seqnames        ranges strand |        name     score
          <Rle>     <IRanges>  <Rle> | <character> <numeric>
   [1]    chr19 242824-242864      * |        <NA>         1
   [2]     chr1 347969-501617      * |        <NA>         2
   -------
   seqinfo: 2 sequences from an unspecified genome; no seqlengths

I am not sure I would argue that UCSC's liftOver does a materially better job in this instance than Michael's version.

ADD REPLY

Login before adding your answer.

Traffic: 958 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6