Question

GenomicRanges bug in follow?

1

Entering edit mode

Janet Young ▴ 740

@janet-young-2360

Last seen 5.4 years ago

Fred Hutchinson Cancer Research Center,…

Hi there,

I think I found a bug in GenomicRanges: the follow function is not behaving correctly on a GRanges object - the code below should show you what I mean. I've included sessionInfo output. I get the same thing with the devel packages.

Hope it's easy to fix! Thanks, as always,

Janet

###################

library(GenomicRanges)

myGR_1 <- GRanges(seqnames="chr1", ranges=IRanges(start=c(100,1100,1500), width=1))
myGR_2 <- GRanges(seqnames="chr1", ranges=IRanges(start=c(200,1600), width=1))

## follow and precede give output that looks correct when I first extract the ranges:
follow( ranges(myGR_1), ranges(myGR_2))
# [1] NA 1 1
precede( ranges(myGR_1), ranges(myGR_2))
# [1] 1 2 2

## but if I use them on the GRanges objects themselves, follow gives the wrong output (looks like it's calling precede):
follow( myGR_1, myGR_2)
# [1] 1 2 2
precede( myGR_1, myGR_2)
# [1] 1 2 2

sessionInfo()

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] GenomicRanges_1.20.8 GenomeInfoDb_1.4.2 IRanges_2.2.7 S4Vectors_0.6.6 BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] XVector_0.8.0

#### same issue using the devel version:

sessionInfo()

R version 3.2.2 (2015-08-14)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Ubuntu 14.04.2 LTS

locale:

[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C

[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8

[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8

[7] LC_PAPER=en_US.UTF-8 LC_NAME=C

[9] LC_ADDRESS=C LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] stats4 parallel stats graphics grDevices utils datasets

[8] methods base

other attached packages:

[1] GenomicRanges_1.21.28 GenomeInfoDb_1.5.15 IRanges_2.3.21

[4] S4Vectors_0.7.18 BiocGenerics_0.15.6

loaded via a namespace (and not attached):

[1] zlibbioc_1.15.0 BiocInstaller_1.19.14 XVector_0.9.4

[4] tools_3.2.2 tcltk_3.2.2

genomicranges • 3.0k views

ADD COMMENT • link 9.6 years ago • updated 9.5 years ago Janet Young ▴ 740

score 0 · Answer 1 · 2015-09-21

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.3 years ago

United States

This seems to cause a lot of confusion. When ignore.strand=FALSE, the "*" strand (indicating the direction) is interpreted as "either", so precede and follow end up returning the same thing. A basic rule of thumb is to always pass ignore.strand=FALSE, unless strand does matter.

ADD COMMENT • link 9.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Yes, treatment of '*' strand in this context does seem to confuse many. What would be a better solution? All overlap operations match '*' strand to both '+' and '-' so this behavior is consistent in that regard. The behavior for different strand combinations is thoroughly documented on the man page.

Valerie

ADD REPLY • link 9.6 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

The "*" could be interpreted as meaning "ignore", in which case behavior should conform to that of ranges(gr) (as well as considering seqnames). That would still be consistent with the current findOverlaps() behavior.

ADD REPLY • link 9.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

I get it now - thank you both. Yes, that is a bit confusing: I had understood "*" as meaning "no strand" rather than "both strands" (in this context). Turns out "ignore.strand=TRUE" is what I should be doing: I just want the next region to the "right" in the genome.

Michael's suggestion to interpret "*" as ignore.strand=TRUE does make intuitive sense to me, but obviously I'm not aware of the wider implications. As it stands it seems that follow, precede and nearest are all equivalent to nearest for * strand objects (is that right?): I think the naive user would simply use "nearest" if that's the behaviour they're looking for. However, I'm having a bit of a hard time thinking about the big picture: my current use case is a dataset of objects that ALL have strand="*" - I'm not sure what a sensible use case would be if the objects were a mixture of +/-/* strands (or even what kind of dataset would have a mix of +/- with * strands).

The documentation: you're right - it turns out it is well-documented, but because I thought my ranges had no strand, I had totally ignored the relevant section of the help page: oops! If I'm not the first one to be confused by this, I wonder if it would help to add something like this in the details section for precede/follow: "Regions with +/-/* strands behave differently in this function: see Orientation and Strand section, below".

ADD REPLY • link 9.6 years ago Janet Young ▴ 740

0

Entering edit mode

Actually, precede() and follow() gave the nearest but not overlapped range in that version.

> (peak1 <- GRanges("chr1", IRanges(c(3, 7), width = 1)))
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [3, 3]      *
  [2]     chr1    [7, 7]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> (peak2 <- GRanges("chr1", IRanges(c(1, 3, 5, 7, 9), width = 1)))
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [1, 1]      *
  [2]     chr1    [3, 3]      *
  [3]     chr1    [5, 5]      *
  [4]     chr1    [7, 7]      *
  [5]     chr1    [9, 9]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> precede(peak1, peak2, select = "all") # nearest but not overlapped
Hits of length 4
queryLength: 2
subjectLength: 5
  queryHits subjectHits
   <integer>   <integer>
1         1           1
2         1           3
3         2           3
4         2           5
> identical(precede(peak1, peak2, select = "all"),
+           follow(peak1, peak2, select = "all"))
[1] TRUE
> nearest(peak1, peak2, select = "all") # including overlapped
Hits of length 2
queryLength: 2
subjectLength: 5
  queryHits subjectHits
   <integer>   <integer>
1         1           2
2         2           4

ADD REPLY • link 8.3 years ago wcstcyx ▴ 30

0

Entering edit mode

Hi,

I agree that the current interpretation of * is confusing. IMO it should be consistent with what you get when using ignore.strand=TRUE:

follow(myGR_1, myGR_2)
# [1] 1 2 2
follow(myGR_1, myGR_2, ignore.strand=TRUE)
# [1] NA  1  1

The current behavior gives you the nearest range. If people want the nearest, they have nearest() for that:

nearest(myGR_1, myGR_2)
# [1] 1 2 2

H.

ADD REPLY • link 9.6 years ago Hervé Pagès 16k

0

Entering edit mode

I agree; '*' should be equivalent to ignore.strand=TRUE.

ADD REPLY • link 9.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks for the feedback.

In GenomicRanges 1.21.29 behavior of '*' strand is consistent with ignore.strand = TRUE.

myGR_1 <- GRanges("chr1", IRanges(start=c(100,1100,1500), width=1))
myGR_2 <- GRanges("chr1", IRanges(start=c(200,1600), width=1))

> precede(myGR_1, myGR_2, ignore.strand=FALSE)
[1] 1 2 2
> precede(myGR_1, myGR_2, ignore.strand=TRUE)
[1] 1 2 2

> follow(myGR_1, myGR_2, ignore.strand=FALSE)
[1] NA  1  1
> follow(myGR_1, myGR_2, ignore.strand=TRUE)
[1] NA  1  1

Valerie

ADD REPLY • link 9.5 years ago Valerie Obenchain ★ 6.8k

score 0 · Answer 2 · 2015-10-29

0

Entering edit mode

Janet Young ▴ 740

@janet-young-2360

Last seen 5.4 years ago

Fred Hutchinson Cancer Research Center,…

Thanks all - much appreciated! That makes more intuitive sense to me now.

ADD COMMENT • link 9.5 years ago Janet Young ▴ 740

0

Entering edit mode

Hi Janet,

Could you please look at C: `precede` does not perform as expected. This change introduces something that is not very meaningful in biology.

ADD REPLY • link 8.3 years ago wcstcyx ▴ 30