Hi Dario,
Unfortunately the interpretation of the * level in the strand is a
little bit ambiguous: it can mean (a) "range/feature is on both
strands" but it can also mean (b) "strand is unknown" or "strand
is irrelevant". We used to allow NAs in the strand factor in the
early days of the GenomicRanges package to represent (b) so we
were able to distinguish between (a) and (b). But this didn't prove
to be useful enough to justify the "cost" so we decided to not
allow NAs in the strand anymore.
Another somewhat arbitrary design decision was that all the "set
operations" would treat * as a separate (virtual) strand:
> library(GenomicRanges)
> x <- GRanges("chr1", IRanges(2, 7), strand="+")
> y <- GRanges("chr1", IRanges(5, 10), strand="*")
> union(x, y)
GRanges with 2 ranges and 0 elementMetadata values:
seqnames ranges strand
<rle> <iranges> <rle>
[1] chr1 [2, 7] +
[2] chr1 [5, 10] *
---
seqlengths:
chr1
NA
> intersect(x, y)
GRanges with 0 ranges and 0 elementMetadata values:
seqnames ranges strand
<rle> <iranges> <rle>
---
seqlengths:
chr1
NA
> setdiff(x, y)
GRanges with 1 range and 0 elementMetadata values:
seqnames ranges strand
<rle> <iranges> <rle>
[1] chr1 [2, 7] +
---
seqlengths:
chr1
NA
So, by default (i.e. with ignore.strand=FALSE), setdiff() is just
being
consistent with the other set operations.
Not necessarily the most natural/intuitive behavior but it has the big
advantage to be something that all GenomicRanges methods can agree on,
and to be easy to describe and simple to implement.
To get what you want, you'll need to do setdiff() twice:
> strand(y) <- "+"
> z <- setdiff(x, y)
> strand(y) <- "-"
> z <- setdiff(z, y)
> z
GRanges with 1 range and 0 elementMetadata values:
seqnames ranges strand
<rle> <iranges> <rle>
[1] chr1 [2, 4] +
---
seqlengths:
chr1
NA
As for the weird behavior you get with ignore.strand=TRUE, I agree
it doesn't make much sense and it's not totally clear to me what the
correct behavior should be. Sadly the man page doesn't say anything
about what the intended behavior is (there is only a brief note about
what ignore.strand=TRUE does for parallel set operations), gives no
example, and this case is not covered by the unit tests either.
However, by looking at what union() and gaps() do with
ignore.strand=TRUE, and also at the internals of the GenomicRanges
package, it seems to me that the intention was that:
union(x, y, ignore.strand=TRUE)
intersect(x, y, ignore.strand=TRUE)
setdiff(x, y, ignore.strand=TRUE)
would be equivalent to:
union(`strand<-`(x, "+"), `strand<-`(y, "+"))
intersect(`strand<-`(x, "+"), `strand<-`(y, "+"))
setdiff(`strand<-`(x, "+"), `strand<-`(y, "+"))
So I could fix setdiff() and intersect() to do this (intersect() has
a weird behavior too), and update the documentation. Does that sound
reasonable?
Finally I'd like to mention that I see very little value in supporting
the 'ignore.strand' arg in those 3 methods so another option would be
to just get rid of it.
Cheers,
H.
On 01/15/2012 10:00 PM, Dario Strbenac wrote:
> Hi,
>
> setdiff doesn't work the way I thought it would when one GRanges has
strands and the other does not. I thought that features stranded as *
would be taken as being on both strands. The output is even stranger
if I use ignore.strand = TRUE.
>
>> x
> GRanges with 1 range and 0 elementMetadata values:
> seqnames ranges strand
> <rle> <iranges> <rle>
> [1] chr1 [2, 7] +
> ---
> seqlengths:
> chr1
> NA
>> y
> GRanges with 1 range and 0 elementMetadata values:
> seqnames ranges strand
> <rle> <iranges> <rle>
> [1] chr1 [5, 10] *
> ---
> seqlengths:
> chr1
> NA
>> setdiff(x, y)
> GRanges with 1 range and 0 elementMetadata values:
> seqnames ranges strand
> <rle> <iranges> <rle>
> [1] chr1 [2, 7] +
> ---
> seqlengths:
> chr1
> NA
>> setdiff(x, y, ignore.strand = TRUE)
> GRanges with 2 ranges and 0 elementMetadata values:
> seqnames ranges strand
> <rle> <iranges> <rle>
> [1] chr1 [1, 10] -
> [2] chr1 [1, 10] *
> ---
> seqlengths:
> chr1
> NA
>
> How do I get a result of chr1 [2, 4] + ?
>
> R version 2.14.0 (2011-10-31), GenomicRanges_1.6.4, IRanges_1.12.5
>
> --------------------------------------
> Dario Strbenac
> Research Assistant
> Cancer Epigenetics
> Garvan Institute of Medical Research
> Darlinghurst NSW 2010
> Australia
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319