Hi there,
I think I found a bug in GenomicRanges: the follow function is not behaving correctly on a GRanges object - the code below should show you what I mean. I've included sessionInfo output. I get the same thing with the devel packages.
Hope it's easy to fix! Thanks, as always,
Janet
###################
library(GenomicRanges)
myGR_1 <- GRanges(seqnames="chr1", ranges=IRanges(start=c(100,1100,1500), width=1))
myGR_2 <- GRanges(seqnames="chr1", ranges=IRanges(start=c(200,1600), width=1))
## follow and precede give output that looks correct when I first extract the ranges:
follow( ranges(myGR_1), ranges(myGR_2))
# [1] NA 1 1
precede( ranges(myGR_1), ranges(myGR_2))
# [1] 1 2 2
## but if I use them on the GRanges objects themselves, follow gives the wrong output (looks like it's calling precede):
follow( myGR_1, myGR_2)
# [1] 1 2 2
precede( myGR_1, myGR_2)
# [1] 1 2 2
sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicRanges_1.20.8 GenomeInfoDb_1.4.2 IRanges_2.2.7 S4Vectors_0.6.6 BiocGenerics_0.14.0
loaded via a namespace (and not attached):
[1] XVector_0.8.0
#### same issue using the devel version:
sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicRanges_1.21.28 GenomeInfoDb_1.5.15 IRanges_2.3.21
[4] S4Vectors_0.7.18 BiocGenerics_0.15.6
loaded via a namespace (and not attached):
[1] zlibbioc_1.15.0 BiocInstaller_1.19.14 XVector_0.9.4
[4] tools_3.2.2 tcltk_3.2.2
Yes, treatment of '*' strand in this context does seem to confuse many. What would be a better solution? All overlap operations match '*' strand to both '+' and '-' so this behavior is consistent in that regard. The behavior for different strand combinations is thoroughly documented on the man page.
Valerie
The "*" could be interpreted as meaning "ignore", in which case behavior should conform to that of ranges(gr) (as well as considering seqnames). That would still be consistent with the current findOverlaps() behavior.
I get it now - thank you both. Yes, that is a bit confusing: I had understood "*" as meaning "no strand" rather than "both strands" (in this context). Turns out "ignore.strand=TRUE" is what I should be doing: I just want the next region to the "right" in the genome.
Michael's suggestion to interpret "*" as ignore.strand=TRUE does make intuitive sense to me, but obviously I'm not aware of the wider implications. As it stands it seems that follow, precede and nearest are all equivalent to nearest for * strand objects (is that right?): I think the naive user would simply use "nearest" if that's the behaviour they're looking for. However, I'm having a bit of a hard time thinking about the big picture: my current use case is a dataset of objects that ALL have strand="*" - I'm not sure what a sensible use case would be if the objects were a mixture of +/-/* strands (or even what kind of dataset would have a mix of +/- with * strands).
The documentation: you're right - it turns out it is well-documented, but because I thought my ranges had no strand, I had totally ignored the relevant section of the help page: oops! If I'm not the first one to be confused by this, I wonder if it would help to add something like this in the details section for precede/follow: "Regions with +/-/* strands behave differently in this function: see Orientation and Strand section, below".
Actually, precede() and follow() gave the nearest but not overlapped range in that version.
Hi,
I agree that the current interpretation of
*
is confusing. IMO it should be consistent with what you get when usingignore.strand=TRUE
:The current behavior gives you the nearest range. If people want the nearest, they have
nearest()
for that:H.
I agree; '*' should be equivalent to ignore.strand=TRUE.
Thanks for the feedback.
In GenomicRanges 1.21.29 behavior of '*' strand is consistent with ignore.strand = TRUE.
Valerie