Behaviour of within in findOverlaps function in connection with maxgap parameter
1
0
Entering edit mode
hhoeflin ▴ 10
@hhoeflin-7706
Last seen 8.5 years ago
Switzerland

Hi, 

I am using the findOverlaps function and find some behaviour that I was puzzled about, specifically from the maxgap parameter. According to the IRanges manual

The ‘maxgap’ parameter has special meaning with the special overlap types. For ‘start’, ‘end’, and ‘equal’, it specifies the maximum difference in the starts, ends or both, respectively. For ‘within’, it is the maximum amount by which the query may be wider than the subject.

So - from what I understand it - when I use type="within", with a maxgap of 10, then any "query" that is contained in [subject.start - 10, subject.end + 10] should be returned (or in a different reading, any query where the part outside of subject added up on both sides of the subject is not larger than 10). Do I misunderstand that? Below some code where this is not happening

granges_within_query <- GRanges(seqnames=1, ranges=IRanges(6 - (1:5), 6 + 1:5), strand="*")
granges_within_subject <- GRanges(seqnames=1, ranges=IRanges(3, 9), strand="*")

 

> granges_within_query
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1   [5,  7]      *
  [2]        1   [4,  8]      *
  [3]        1   [3,  9]      *
  [4]        1   [2, 10]      *
  [5]        1   [1, 11]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> granges_within_subject
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1    [3, 9]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

 

Now I run the findOverlaps function 

 

res_within_0 <- GenomicRanges::findOverlaps(granges_within_query, granges_within_subject, type="within", select="all", minoverlap=1, maxgap=0)
res_within_1 <- GenomicRanges::findOverlaps(granges_within_query, granges_within_subject, type="within", select="all", minoverlap=1, maxgap=1)
res_within_2 <- GenomicRanges::findOverlaps(granges_within_query, granges_within_subject, type="within", select="all", minoverlap=1, maxgap=2)

with results

> res_within_0
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         2           1
  [3]         3           1
  -------
  queryLength: 5 / subjectLength: 1
> res_within_1
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         3           1
  -------
  queryLength: 5 / subjectLength: 1
> res_within_2
Hits object with 2 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         2           1
  [2]         3           1
  -------
  queryLength: 5 / subjectLength: 1

 

Here, res_within_0 gives the expected result - the first 3 ranges of the query. However, with maxgap=1, I would have expected that the first 3 are returned, with maxgap=2, the first 4 (if we interpret "the query may be wider than the subject" in the manual to mean wider on both sides combined rather than on each side separately ...). However, with maxgap=1, only the 3rd is returned (equal to the subject) and with maxgap=2 also the second (but why not the fourth?).

 

Any help would be appreciated. 


Thanks

Holger

 

 

 

IRanges genomicranges • 1.9k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 2 hours ago
Seattle, WA, United States

Hi Holger,

You found a mistake in our documentation. Query and subject are inverted in the sentence:

        For within, it is the maximum amount by which the query may be wider than the subject.

Should be:

        For within, it is the maximum amount by which the subject may be wider than the query.

When type is within, the query should always be within the subject. If, in addition, a non-zero maxgap is specified, then an additional criteria is applied: the amount by which the subject is wider than the query (i.e. width(subject) - width(query)) should not exceed maxgap.

Ranges 4 and 5 in your granges_within_query object are not within the subject so they don't generate hits. When you use maxgap=1 (in addition to type="within"), you loose the 1st and 2nd hits because the amount by which the subject is wider than the query for these hits is 4 and 2, respectively. When you use maxgap=2 (in addition to type="within"), you get the 2nd hit back because the amount by which the subject is wider than the query for this hit is 2.

It seems that this mistake in our documentation has been around for years (since the very early days of findOverlaps). Excellent catch! I just fixed it in IRanges 2.6.1 (release) and 2.7.9 (devel). Both versions should become available via biocLite() in the next 48 hours or so.

Cheers,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6