GRanges::distanceToNearest does not return distance to nearest
2
0
Entering edit mode
@daniel-cameron-10086
Last seen 2.6 years ago
Australia

I am attempting to find the distance to the closest upstream but distanceToNearest does not appear to return the closest.

A simple reproduction is as follows:

 

> subject <- GRanges(seqnames="chr1", ranges=IRanges(start=c(1000, 1500, 2000, 2500, 3000, 3500), width=1),strand=c("+", "-", "+", "-", "+", "-"))

> distanceToNearest(GRanges(seqnames="chr1", ranges=IRanges(start=1100, width=1), strand="*"), subject, select="all")

Hits object with 2 hits and 1 metadata column:

      queryHits subjectHits |  distance

      <integer>   <integer> | <integer>

  [1]         1           1 |        99

  [2]         1           2 |       399

  -------

  queryLength: 1 / subjectLength: 6

 

So far so good. We're 99bp downstream from the from the first subject row, and 399 downstream on the opposite strand from the second subject row. Without the subject parameter, one would expect this to return the 99 as that's less than 399.

 

> distanceToNearest(GRanges(seqnames="chr1", ranges=IRanges(start=1100, width=1), strand="*"), subject)

Hits object with 1 hit and 1 metadata column:

      queryHits subjectHits |  distance

      <integer>   <integer> | <integer>

  [1]         1           2 |       399

  -------

  queryLength: 1 / subjectLength: 6

Why do I get the 399 result, and not the 99 result? The documentation states that select resolves arbitrarily for nearest(), but that doesn't make sense for distancetoNearest() when there is an actual distance metric returned. Why isn't the hit with the minimum distance returned? I'm confused as to how distancetoNearest() can be useful for queries with strand="*" with its current behaviour. Am I missing something?

 

 

> sessionInfo()

R version 3.4.2 (2017-09-28)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:

[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      

[5] LC_TIME=English_Australia.1252    


attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     


other attached packages:

 [1] biomaRt_2.34.0                          org.Hs.eg.db_3.5.0                      bindrcpp_0.2                            TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0

 [5] GenomicFeatures_1.30.0                  AnnotationDbi_1.40.0                    Biobase_2.38.0                          rtracklayer_1.38.0                     

 [9] GenomicRanges_1.30.0                    GenomeInfoDb_1.14.0                     IRanges_2.12.0                          S4Vectors_0.16.0                       

[13] BiocGenerics_0.24.0                     forcats_0.2.0                           stringr_1.2.0                           dplyr_0.7.4                            

[17] purrr_0.2.4                             readr_1.1.1                             tidyr_0.7.2                             tibble_1.3.4                           

[21] ggplot2_2.2.1                           tidyverse_1.2.1                         openxlsx_4.0.17                        


loaded via a namespace (and not attached):

 [1] nlme_3.1-131               bitops_1.0-6               matrixStats_0.52.2         lubridate_1.7.1            bit64_0.9-7                progress_1.1.2            

 [7] httr_1.3.1                 rprojroot_1.2              tools_3.4.2                backports_1.1.1            R6_2.2.2                   DBI_0.7                   

[13] lazyeval_0.2.1             colorspace_1.3-2           prettyunits_1.0.2          mnormt_1.5-5               RMySQL_0.10.13             bit_1.1-12                

[19] compiler_3.4.2             cli_1.0.0                  rvest_0.3.2                xml2_1.1.1                 DelayedArray_0.4.1         scales_0.5.0              

[25] psych_1.7.8                digest_0.6.12              Rsamtools_1.30.0           foreign_0.8-69             rmarkdown_1.8              XVector_0.18.0            

[31] base64enc_0.1-3            pkgconfig_2.0.1            htmltools_0.3.6            rlang_0.1.4                readxl_1.0.0               rstudioapi_0.7            

[37] RSQLite_2.0                bindr_0.1                  jsonlite_1.5               BiocParallel_1.12.0        RCurl_1.95-4.8             magrittr_1.5              

[43] GenomeInfoDbData_0.99.1    Matrix_1.2-12              Rcpp_0.12.14               munsell_0.4.3              yaml_2.1.14                stringi_1.1.6             

[49] SummarizedExperiment_1.8.0 zlibbioc_1.24.0            plyr_1.8.4                 grid_3.4.2                 blob_1.1.0                 crayon_1.3.4              

[55] lattice_0.20-35            Biostrings_2.46.0          haven_1.1.0                hms_0.4.0                  knitr_1.17                 reshape2_1.4.2            

[61] XML_3.98-1.9               glue_1.2.0                 evaluate_0.10.1            modelr_0.1.1               cellranger_1.1.0           gtable_0.2.0              

[67] assertthat_0.2.0           broom_0.4.3                GenomicAlignments_1.14.1   memoise_1.1.0             
granges genomicranges • 4.2k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 19 hours ago
United States

This is explained in the help page for distanceToNearest. Do note that under the hood, distanceToNearest calls an internal function .nearest, which uses precede and follow to determine which is closest, and then chooses the distance based on the output from precede and follow. So if we look at the help page, you will see this:

          Orientation and strand for 'precede' and 'follow':
          Orientation is 5' to 3', consistent with the direction of
          translation.  Because positional numbering along a chromosome
          is from left to right and transcription takes place from 5'
          to 3', 'precede' and 'follow' can appear to have `opposite'
          behavior on the '+' and '-' strand. Using positions 5 and 6
          as an example, 5 precedes 6 on the '+' strand but follows 6
          on the '-' strand.

          The table below outlines the orientation when ranges on
          different strands are compared. In general, a feature on '*'
          is considered to belong to both strands. The single exception
          is when both 'x' and 'subject' are '*' in which case both are
          treated as '+'.
          
          
                 x  |  subject  |  orientation
               -----+-----------+----------------
          a)     +  |  +        |  --->
          b)     +  |  -        |  NA
          c)     +  |  *        |  --->
          d)     -  |  +        |  NA
          e)     -  |  -        |  <---
          f)     -  |  *        |  <---
          g)     *  |  +        |  --->
          h)     *  |  -        |  <---
          i)     *  |  *        |  --->  (the only situation where * arbitrarily means +)

Since your 'subject' GRanges is on alternating strands, we have to interpret precede and follow using g) and h) in the above table. For precede, we get 3, because the position in 'subject' that is immediately preceded by chr1:1100-1100 is chr1:2000-2000, which is the third position (the position at chr1-1500-1500, being on the reverse strand is immediately followed by chr1:1100-1100, not preceded). And obviously follow returns 2, because of h).

Since position 2 (399 nt away) is closer than position 3 (899 nt away), distanceToNearest returns 399.

ADD COMMENT
0
Entering edit mode

I should also note that your intuition as to what should be nearest is only true if you ignore the strand information, which you can do with the ignore.strand argument:

> distanceToNearest(GRanges(seqnames="chr1", ranges=IRanges(start=1100, width=1), strand="*"), subject, ignore.strand = TRUE)
Hits object with 1 hit and 1 metadata column:
      queryHits subjectHits |  distance
      <integer>   <integer> | <integer>
  [1]         1           1 |        99
  -------
  queryLength: 1 / subjectLength: 6
ADD REPLY
0
Entering edit mode

I'm looking for the closest upstream TSS for viral insertion sites. This means my query strand is *, but I do care about transcript orientation (subjects are coordinate and strand of TSS). In the case of overlapping genes, distanceToNearest() is not returning the one with the closest TSS.

ADD REPLY
0
Entering edit mode

But the closest subject ranges are at positions 1 and 2, not 2 and 3 as can be seen by calling nearest():

> nearest(GRanges(seqnames="chr1", ranges=IRanges(start=1100, width=1), strand="*"), subject, select="all")
Hits object with 2 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           2
  -------
  queryLength: 1 / subjectLength: 6

Additionally, the distances of 99, and 399 (not 399 and 899) are returned by distanceToNearest() when select="all" is specified.

The problem I am having isn't with the strandedness of the nearest - the problem is that when there are two possible distances (one of each strand), distanceToNearest() does not return the lesser of the two distances.

ADD REPLY
0
Entering edit mode

Thanks for reporting this bug. We'll post back when it's fixed.

Valerie

ADD REPLY
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.9 years ago
United States

Hi,

Thanks for your patience. This is now fixed in GenomicRanges 1.31.3 in devel and 1.30.1 in release. Both should be available via biocLite() tomorrow around 3pm EST.

The bug was hitting the case where a '*' query followed both a '+' and '-' range in subject. Instead of choosing the closest range it was treating them as equals and using the default value of select = "last" to return the answer.

Let me know if you run into other problems.

Valerie

ADD COMMENT

Login before adding your answer.

Traffic: 696 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6