IRanges (GRanges): Inconsistent behaviour when overlapping ranges of width 0
1
1
Entering edit mode
balwierz ▴ 40
@balwierz-8583
Last seen 5.4 years ago
United Kingdom

I will describe the problem using GRanges, but it is solely within IRanges. Ranges of width are apparently allowed. You can obtain them for instance calling `promoters(TxDb, upstream=0, downstream=0)`  

 

Specifically, these ranges cannot be found by `distanceToNearest` _if_ they are within the subject, but are easily found if they lie outside.
They also do not appear in `subsetByOverlaps` and `findOverlaps` unless you change minoverlap from 1L (default) to 0L.

IMO minoverlap should be set to 0L by default in functions which use it, and distanceToNearest should be able to handle these ranges properly. Example code below.

Out of curiosity: how are IRanges implemented internally? Are indices following Jim Kent "UCSC" convention, or is it internally handled as as in user functions?

   a = GRanges(c("a:2-2", "a:4-3", "a:12-13", "a:15-14"))
   b = GRanges("a:1-10")
   a
GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a  [ 2,  2]      *
  [2]        a  [ 4,  3]      *
  [3]        a  [12, 13]      *
  [4]        a  [15, 14]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
 
   width(a)
[1] 1 0 2 0

   b
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a   [1, 10]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
 
   distanceToNearest(a,b)
Hits object with 3 hits and 1 metadata column:
      queryHits subjectHits |  distance
      <integer>   <integer> | <integer>
  [1]         1           1 |         0
  [2]         3           1 |         1
  [3]         4           1 |         4
  -------
  queryLength: 4 / subjectLength: 1
 
   subsetByOverlaps(a,b)
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a    [2, 2]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
 
   subsetByOverlaps(a,b, minoverlap=0)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a    [2, 2]      *
  [2]        a    [4, 3]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
Iranges granges • 1.8k views
ADD COMMENT
0
Entering edit mode

Hi,

Thanks for the report. The current behavior of distanceToNearest() w.r.t. zero-width ranges seems wrong. I'll look into this.

FWIW an IRanges object uses 2 parallel slots to represent the ranges: one slot for the 1-based starts, and one slot for the widths.

H.

ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 5 days ago
Seattle, WA, United States

Hi,

distanceToNearest() (and nearest(), which was also affected) now calls findOverlaps() internally with minoverlap=1 instead of minoverlap=0. With this change, the following code now behaves as expected:

library(GenomicRanges)
a <- GRanges(c("a:2-2", "a:4-3", "a:12-13", "a:15-14"))
b <- GRanges("a:1-10")
distanceToNearest(a,b)
# Hits object with 4 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         0
#   [2]         2           1 |         0
#   [3]         3           1 |         1
#   [4]         4           1 |         4
#   -------
#   queryLength: 4 / subjectLength: 1

This change also fixes the behavior of nearest() in the following situation:

query <- GRanges("chr1", IRanges(5, 10))
subject <- GRanges("chr1", IRanges(1, 4:5))
nearest(query, subject, select="all")
# Hits object with 2 hits and 0 metadata columns:
#       queryHits subjectHits
#       <integer>   <integer>
#   [1]         1           1
#   [2]         1           2
#   -------
#  queryLength: 1 / subjectLength: 2

The 2 ranges in the subject now are both considered to be nearest to the 5-10 range (instead of only the 2nd one).

The fixes are in IRanges 2.10.3 (release) and 2.11.14 (devel), and in GenomicRanges 1.28.5 (release) and 1.29.13 (devel).

For subsetByOverlaps(), my inclination would be to just change the minoverlap default from 1 to 0 in the interface of the generic and all its methods. With this change zero-width ranges in the 1st argument (i.e. in the object to subset) won't be lost anymore. However it's important to realize that even if the 1st argument doesn't contain zero-width ranges, this change will sometimes alter the results obtained so far because ranges that are adjacent to (but not strictly overlapping with) the 2nd argument will now be kept. For example:

  subsetByOverlaps(IRanges(9:12, 15), IRanges(1, 10))

will return IRanges(9:11, 15) instead of IRanges(9:10, 15).

I'll post again here when it's done.

Thanks for your feedback,

H.

ADD COMMENT
0
Entering edit mode

So I made that change to subsetByOverlaps() but in devel only (in IRanges 2.11.15). Note that it's now consistent with the snpsByOverlaps() function from the BSgenome package which has been using minoverlap=0 by default since the beginning. Other *ByOverlaps() functions (e.g. cdsByOverlaps() in the GenomicFeatures package) would need to be modified to do this too but I'll keep this for another time.

Cheers,

H.

ADD REPLY
0
Entering edit mode

Hi,

After more discussion on this (see here https://github.com/Bioconductor/IRanges/issues/1), I've decided to change the maxgap and minoverlap defaults to -1 and 0, respectively, for all the members of the findOverlaps() family. With this change everybody behaves consistently with respect to zero-width ranges and adjacent ranges. In particular, by default, nobody ignores zero-width ranges or counts adjacent ranges as overlapping. Use minoverlap=1 to ignore zero-width ranges or maxgap=0 to count adjacent ranges as overlapping.

The change is in IRanges 2.11.16, GenomicRanges 1.29.14, GenomicAlignments 1.13.6, and SummarizedExperiment 1.7.8. There are more Bioconductor packages that define methods for members of the findOverlaps() family and they will probably need to be modified too.

See ?findOverlaps (in IRanges >= 2.11.16) for more information about maxgap and the meaning of maxgap=-1.

Cheers,

H.

ADD REPLY

Login before adding your answer.

Traffic: 447 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6