GenomicRanges bug in follow?
2
1
Entering edit mode
Janet Young ▴ 740
@janet-young-2360
Last seen 5.1 years ago
Fred Hutchinson Cancer Research Center,…

Hi there,

I think I found a bug in GenomicRanges: the follow function is not behaving correctly on a GRanges object - the code below should show you what I mean.  I've included sessionInfo output.   I get the same thing with the devel packages.

Hope it's easy to fix! Thanks, as always,

Janet

###################

 

library(GenomicRanges)

myGR_1 <- GRanges(seqnames="chr1", ranges=IRanges(start=c(100,1100,1500), width=1))
myGR_2 <- GRanges(seqnames="chr1", ranges=IRanges(start=c(200,1600), width=1))

## follow and precede give output that looks correct when I first extract the ranges:
follow( ranges(myGR_1), ranges(myGR_2))
# [1] NA  1  1
precede( ranges(myGR_1), ranges(myGR_2))
# [1] 1 2 2

## but if I use them on the GRanges objects themselves, follow gives the wrong output (looks like it's calling precede):
follow( myGR_1, myGR_2)
# [1] 1 2 2
precede( myGR_1, myGR_2)
# [1] 1 2 2

sessionInfo()

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicRanges_1.20.8 GenomeInfoDb_1.4.2   IRanges_2.2.7        S4Vectors_0.6.6      BiocGenerics_0.14.0 

loaded via a namespace (and not attached):
[1] XVector_0.8.0

 

#### same issue using the devel version:

sessionInfo()

R version 3.2.2 (2015-08-14)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Ubuntu 14.04.2 LTS

 

locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    

 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

 

attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils     datasets 

[8] methods   base     

 

other attached packages:

[1] GenomicRanges_1.21.28 GenomeInfoDb_1.5.15   IRanges_2.3.21       

[4] S4Vectors_0.7.18      BiocGenerics_0.15.6  

 

loaded via a namespace (and not attached):

[1] zlibbioc_1.15.0       BiocInstaller_1.19.14 XVector_0.9.4        

[4] tools_3.2.2           tcltk_3.2.2          

 

genomicranges • 2.8k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States

This seems to cause a lot of confusion. When ignore.strand=FALSE, the "*" strand (indicating the direction) is interpreted as "either", so precede and follow end up returning the same thing. A basic rule of thumb is to always pass ignore.strand=FALSE, unless strand does matter.

ADD COMMENT
0
Entering edit mode

Yes, treatment of '*' strand in this context does seem to confuse many. What would be a better solution? All overlap operations match '*' strand to both '+' and '-' so this behavior is consistent in that regard. The behavior for different strand combinations is thoroughly documented on the man page.

Valerie

ADD REPLY
0
Entering edit mode

The "*" could be interpreted as meaning "ignore", in which case behavior should conform to that of ranges(gr) (as well as considering seqnames). That would still be consistent with the current findOverlaps() behavior.

ADD REPLY
0
Entering edit mode

I get it now - thank you both.  Yes, that is a bit confusing: I had understood "*" as meaning "no strand" rather than "both strands" (in this context).  Turns out "ignore.strand=TRUE" is what I should be doing: I just want the next region to the "right" in the genome.

Michael's suggestion to interpret "*" as ignore.strand=TRUE does make intuitive sense to me, but obviously I'm not aware of the wider implications. As it stands it seems that follow, precede and nearest are all equivalent to nearest for * strand objects (is that right?): I think the naive user would simply use "nearest" if that's the behaviour they're looking for.  However, I'm having a bit of a hard time thinking about the big picture:  my current use case is a dataset of objects that ALL have strand="*" - I'm not sure what a sensible use case would be if the objects were a mixture of +/-/* strands (or even what kind of dataset would have a mix of +/- with * strands). 

The documentation: you're right - it turns out it is well-documented, but because I thought my ranges had no strand, I had totally ignored the relevant section of the help page: oops!  If I'm not the first one to be confused by this, I wonder if it would help to add something like this in the details section for precede/follow: "Regions with +/-/* strands behave differently in this function: see Orientation and Strand section, below".  

ADD REPLY
0
Entering edit mode

Actually, precede() and follow() gave the nearest but not overlapped range in that version.

> (peak1 <- GRanges("chr1", IRanges(c(3, 7), width = 1)))
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [3, 3]      *
  [2]     chr1    [7, 7]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> (peak2 <- GRanges("chr1", IRanges(c(1, 3, 5, 7, 9), width = 1)))
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [1, 1]      *
  [2]     chr1    [3, 3]      *
  [3]     chr1    [5, 5]      *
  [4]     chr1    [7, 7]      *
  [5]     chr1    [9, 9]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> precede(peak1, peak2, select = "all") # nearest but not overlapped
Hits of length 4
queryLength: 2
subjectLength: 5
  queryHits subjectHits
   <integer>   <integer>
1         1           1
2         1           3
3         2           3
4         2           5
> identical(precede(peak1, peak2, select = "all"),
+           follow(peak1, peak2, select = "all"))
[1] TRUE
> nearest(peak1, peak2, select = "all") # including overlapped
Hits of length 2
queryLength: 2
subjectLength: 5
  queryHits subjectHits
   <integer>   <integer>
1         1           2
2         2           4 
ADD REPLY
0
Entering edit mode

Hi,

I agree that the current interpretation of * is confusing. IMO it should be consistent with what you get when using ignore.strand=TRUE:

follow(myGR_1, myGR_2)
# [1] 1 2 2
follow(myGR_1, myGR_2, ignore.strand=TRUE)
# [1] NA  1  1

The current behavior gives you the nearest range. If people want the nearest, they have nearest() for that:

nearest(myGR_1, myGR_2)
# [1] 1 2 2

H.

 

ADD REPLY
0
Entering edit mode

I agree; '*' should be equivalent to ignore.strand=TRUE.

ADD REPLY
0
Entering edit mode

Thanks for the feedback. 

In GenomicRanges 1.21.29 behavior of '*' strand is consistent with ignore.strand = TRUE. 

myGR_1 <- GRanges("chr1", IRanges(start=c(100,1100,1500), width=1))
myGR_2 <- GRanges("chr1", IRanges(start=c(200,1600), width=1))

> precede(myGR_1, myGR_2, ignore.strand=FALSE)
[1] 1 2 2
> precede(myGR_1, myGR_2, ignore.strand=TRUE)
[1] 1 2 2

> follow(myGR_1, myGR_2, ignore.strand=FALSE)
[1] NA  1  1
> follow(myGR_1, myGR_2, ignore.strand=TRUE)
[1] NA  1  1

Valerie

 

ADD REPLY
0
Entering edit mode
Janet Young ▴ 740
@janet-young-2360
Last seen 5.1 years ago
Fred Hutchinson Cancer Research Center,…

Thanks all - much appreciated!  That makes more intuitive sense to me now.

ADD COMMENT
0
Entering edit mode

Hi Janet,

Could you please look at C: `precede` does not perform as expected. This change introduces something that is not very meaningful in biology.

ADD REPLY

Login before adding your answer.

Traffic: 607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6