Question

GRanges::distanceToNearest does not return distance to nearest

0

Entering edit mode

Daniel Cameron ▴ 40

@daniel-cameron-10086

Last seen 3.0 years ago

Australia

I am attempting to find the distance to the closest upstream but distanceToNearest does not appear to return the closest.

A simple reproduction is as follows:

> subject <- GRanges(seqnames="chr1", ranges=IRanges(start=c(1000, 1500, 2000, 2500, 3000, 3500), width=1),strand=c("+", "-", "+", "-", "+", "-"))

> distanceToNearest(GRanges(seqnames="chr1", ranges=IRanges(start=1100, width=1), strand="*"), subject, select="all")

Hits object with 2 hits and 1 metadata column:

      queryHits subjectHits |  distance

      <integer>   <integer> | <integer>

  [1]         1           1 |        99

  [2]         1           2 |       399

  -------

  queryLength: 1 / subjectLength: 6

So far so good. We're 99bp downstream from the from the first subject row, and 399 downstream on the opposite strand from the second subject row. Without the subject parameter, one would expect this to return the 99 as that's less than 399.

> distanceToNearest(GRanges(seqnames="chr1", ranges=IRanges(start=1100, width=1), strand="*"), subject)

Hits object with 1 hit and 1 metadata column:

      queryHits subjectHits |  distance

      <integer>   <integer> | <integer>

  [1]         1           2 |       399

  -------

  queryLength: 1 / subjectLength: 6

Why do I get the 399 result, and not the 99 result? The documentation states that select resolves arbitrarily for nearest(), but that doesn't make sense for distancetoNearest() when there is an actual distance metric returned. Why isn't the hit with the minimum distance returned? I'm confused as to how distancetoNearest() can be useful for queries with strand="*" with its current behaviour. Am I missing something?

> sessionInfo()

R version 3.4.2 (2017-09-28)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:

[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      

[5] LC_TIME=English_Australia.1252    


attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     


other attached packages:

 [1] biomaRt_2.34.0                          org.Hs.eg.db_3.5.0                      bindrcpp_0.2                            TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0

 [5] GenomicFeatures_1.30.0                  AnnotationDbi_1.40.0                    Biobase_2.38.0                          rtracklayer_1.38.0                     

 [9] GenomicRanges_1.30.0                    GenomeInfoDb_1.14.0                     IRanges_2.12.0                          S4Vectors_0.16.0                       

[13] BiocGenerics_0.24.0                     forcats_0.2.0                           stringr_1.2.0                           dplyr_0.7.4                            

[17] purrr_0.2.4                             readr_1.1.1                             tidyr_0.7.2                             tibble_1.3.4                           

[21] ggplot2_2.2.1                           tidyverse_1.2.1                         openxlsx_4.0.17                        


loaded via a namespace (and not attached):

 [1] nlme_3.1-131               bitops_1.0-6               matrixStats_0.52.2         lubridate_1.7.1            bit64_0.9-7                progress_1.1.2            

 [7] httr_1.3.1                 rprojroot_1.2              tools_3.4.2                backports_1.1.1            R6_2.2.2                   DBI_0.7                   

[13] lazyeval_0.2.1             colorspace_1.3-2           prettyunits_1.0.2          mnormt_1.5-5               RMySQL_0.10.13             bit_1.1-12                

[19] compiler_3.4.2             cli_1.0.0                  rvest_0.3.2                xml2_1.1.1                 DelayedArray_0.4.1         scales_0.5.0              

[25] psych_1.7.8                digest_0.6.12              Rsamtools_1.30.0           foreign_0.8-69             rmarkdown_1.8              XVector_0.18.0            

[31] base64enc_0.1-3            pkgconfig_2.0.1            htmltools_0.3.6            rlang_0.1.4                readxl_1.0.0               rstudioapi_0.7            

[37] RSQLite_2.0                bindr_0.1                  jsonlite_1.5               BiocParallel_1.12.0        RCurl_1.95-4.8             magrittr_1.5              

[43] GenomeInfoDbData_0.99.1    Matrix_1.2-12              Rcpp_0.12.14               munsell_0.4.3              yaml_2.1.14                stringi_1.1.6             

[49] SummarizedExperiment_1.8.0 zlibbioc_1.24.0            plyr_1.8.4                 grid_3.4.2                 blob_1.1.0                 crayon_1.3.4              

[55] lattice_0.20-35            Biostrings_2.46.0          haven_1.1.0                hms_0.4.0                  knitr_1.17                 reshape2_1.4.2            

[61] XML_3.98-1.9               glue_1.2.0                 evaluate_0.10.1            modelr_0.1.1               cellranger_1.1.0           gtable_0.2.0              

[67] assertthat_0.2.0           broom_0.4.3                GenomicAlignments_1.14.1   memoise_1.1.0

granges genomicranges • 4.5k views

ADD COMMENT • link updated 7.3 years ago by Valerie Obenchain ★ 6.8k • written 7.4 years ago by Daniel Cameron ▴ 40

score 0 · Answer 1 · 2017-12-05

This is explained in the help page for distanceToNearest. Do note that under the hood, distanceToNearest calls an internal function .nearest, which uses precede and follow to determine which is closest, and then chooses the distance based on the output from precede and follow. So if we look at the help page, you will see this:

          Orientation and strand for 'precede' and 'follow':
          Orientation is 5' to 3', consistent with the direction of
          translation.  Because positional numbering along a chromosome
          is from left to right and transcription takes place from 5'
          to 3', 'precede' and 'follow' can appear to have `opposite'
          behavior on the '+' and '-' strand. Using positions 5 and 6
          as an example, 5 precedes 6 on the '+' strand but follows 6
          on the '-' strand.

          The table below outlines the orientation when ranges on
          different strands are compared. In general, a feature on '*'
          is considered to belong to both strands. The single exception
          is when both 'x' and 'subject' are '*' in which case both are
          treated as '+'.
          
          
                 x  |  subject  |  orientation
               -----+-----------+----------------
          a)     +  |  +        |  --->
          b)     +  |  -        |  NA
          c)     +  |  *        |  --->
          d)     -  |  +        |  NA
          e)     -  |  -        |  <---
          f)     -  |  *        |  <---
          g)     *  |  +        |  --->
          h)     *  |  -        |  <---
          i)     *  |  *        |  --->  (the only situation where * arbitrarily means +)

Since your 'subject' GRanges is on alternating strands, we have to interpret precede and follow using g) and h) in the above table. For precede, we get 3, because the position in 'subject' that is immediately preceded by chr1:1100-1100 is chr1:2000-2000, which is the third position (the position at chr1-1500-1500, being on the reverse strand is immediately followed by chr1:1100-1100, not preceded). And obviously follow returns 2, because of h).

Since position 2 (399 nt away) is closer than position 3 (899 nt away), distanceToNearest returns 399.

score 0 · Answer 2 · 2017-12-20

Hi,

Thanks for your patience. This is now fixed in GenomicRanges 1.31.3 in devel and 1.30.1 in release. Both should be available via biocLite() tomorrow around 3pm EST.

The bug was hitting the case where a '*' query followed both a '+' and '-' range in subject. Instead of choosing the closest range it was treating them as equals and using the default value of select = "last" to return the answer.

Let me know if you run into other problems.

Valerie