GRanges setdiff With Features on Both Strands

0

Entering edit mode

Dario Strbenac ★ 1.5k

@dario-strbenac-5916

Last seen 4 days ago

Australia

Hi, setdiff doesn't work the way I thought it would when one GRanges has strands and the other does not. I thought that features stranded as * would be taken as being on both strands. The output is even stranger if I use ignore.strand = TRUE. > x GRanges with 1 range and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [2, 7] + --- seqlengths: chr1 NA > y GRanges with 1 range and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [5, 10] * --- seqlengths: chr1 NA > setdiff(x, y) GRanges with 1 range and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [2, 7] + --- seqlengths: chr1 NA > setdiff(x, y, ignore.strand = TRUE) GRanges with 2 ranges and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [1, 10] - [2] chr1 [1, 10] * --- seqlengths: chr1 NA How do I get a result of chr1 [2, 4] + ? R version 2.14.0 (2011-10-31), GenomicRanges_1.6.4, IRanges_1.12.5 -------------------------------------- Dario Strbenac Research Assistant Cancer Epigenetics Garvan Institute of Medical Research Darlinghurst NSW 2010 Australia

• 2.4k views

ADD COMMENT • link updated 13.1 years ago by Hervé Pagès 16k • written 13.1 years ago by Dario Strbenac ★ 1.5k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.2 years ago

United States

I noticed this one last week. In particular, the ignore.strands=TRUE weirdness. It is due to the use of gaps(), which returns the entire chromosome length for unrepresented strands. Thus, the subsequent union() call claims the entire chromosome. Would be good to have this fixed. Michael On Sun, Jan 15, 2012 at 10:00 PM, Dario Strbenac <d.strbenac@garvan.org.au>wrote: > Hi, > > setdiff doesn't work the way I thought it would when one GRanges has > strands and the other does not. I thought that features stranded as * would > be taken as being on both strands. The output is even stranger if I use > ignore.strand = TRUE. > > > x > GRanges with 1 range and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [2, 7] + > --- > seqlengths: > chr1 > NA > > y > GRanges with 1 range and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [5, 10] * > --- > seqlengths: > chr1 > NA > > setdiff(x, y) > GRanges with 1 range and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [2, 7] + > --- > seqlengths: > chr1 > NA > > setdiff(x, y, ignore.strand = TRUE) > GRanges with 2 ranges and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [1, 10] - > [2] chr1 [1, 10] * > --- > seqlengths: > chr1 > NA > > How do I get a result of chr1 [2, 4] + ? > > R version 2.14.0 (2011-10-31), GenomicRanges_1.6.4, IRanges_1.12.5 > > -------------------------------------- > Dario Strbenac > Research Assistant > Cancer Epigenetics > Garvan Institute of Medical Research > Darlinghurst NSW 2010 > Australia > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 13.1 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 4 weeks ago

Seattle, WA, United States

Hi Dario, Unfortunately the interpretation of the * level in the strand is a little bit ambiguous: it can mean (a) "range/feature is on both strands" but it can also mean (b) "strand is unknown" or "strand is irrelevant". We used to allow NAs in the strand factor in the early days of the GenomicRanges package to represent (b) so we were able to distinguish between (a) and (b). But this didn't prove to be useful enough to justify the "cost" so we decided to not allow NAs in the strand anymore. Another somewhat arbitrary design decision was that all the "set operations" would treat * as a separate (virtual) strand: > library(GenomicRanges) > x <- GRanges("chr1", IRanges(2, 7), strand="+") > y <- GRanges("chr1", IRanges(5, 10), strand="*") > union(x, y) GRanges with 2 ranges and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [2, 7] + [2] chr1 [5, 10] * --- seqlengths: chr1 NA > intersect(x, y) GRanges with 0 ranges and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> --- seqlengths: chr1 NA > setdiff(x, y) GRanges with 1 range and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [2, 7] + --- seqlengths: chr1 NA So, by default (i.e. with ignore.strand=FALSE), setdiff() is just being consistent with the other set operations. Not necessarily the most natural/intuitive behavior but it has the big advantage to be something that all GenomicRanges methods can agree on, and to be easy to describe and simple to implement. To get what you want, you'll need to do setdiff() twice: > strand(y) <- "+" > z <- setdiff(x, y) > strand(y) <- "-" > z <- setdiff(z, y) > z GRanges with 1 range and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [2, 4] + --- seqlengths: chr1 NA As for the weird behavior you get with ignore.strand=TRUE, I agree it doesn't make much sense and it's not totally clear to me what the correct behavior should be. Sadly the man page doesn't say anything about what the intended behavior is (there is only a brief note about what ignore.strand=TRUE does for parallel set operations), gives no example, and this case is not covered by the unit tests either. However, by looking at what union() and gaps() do with ignore.strand=TRUE, and also at the internals of the GenomicRanges package, it seems to me that the intention was that: union(x, y, ignore.strand=TRUE) intersect(x, y, ignore.strand=TRUE) setdiff(x, y, ignore.strand=TRUE) would be equivalent to: union(`strand<-`(x, "+"), `strand<-`(y, "+")) intersect(`strand<-`(x, "+"), `strand<-`(y, "+")) setdiff(`strand<-`(x, "+"), `strand<-`(y, "+")) So I could fix setdiff() and intersect() to do this (intersect() has a weird behavior too), and update the documentation. Does that sound reasonable? Finally I'd like to mention that I see very little value in supporting the 'ignore.strand' arg in those 3 methods so another option would be to just get rid of it. Cheers, H. On 01/15/2012 10:00 PM, Dario Strbenac wrote: > Hi, > > setdiff doesn't work the way I thought it would when one GRanges has strands and the other does not. I thought that features stranded as * would be taken as being on both strands. The output is even stranger if I use ignore.strand = TRUE. > >> x > GRanges with 1 range and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [2, 7] + > --- > seqlengths: > chr1 > NA >> y > GRanges with 1 range and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [5, 10] * > --- > seqlengths: > chr1 > NA >> setdiff(x, y) > GRanges with 1 range and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [2, 7] + > --- > seqlengths: > chr1 > NA >> setdiff(x, y, ignore.strand = TRUE) > GRanges with 2 ranges and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [1, 10] - > [2] chr1 [1, 10] * > --- > seqlengths: > chr1 > NA > > How do I get a result of chr1 [2, 4] + ? > > R version 2.14.0 (2011-10-31), GenomicRanges_1.6.4, IRanges_1.12.5 > > -------------------------------------- > Dario Strbenac > Research Assistant > Cancer Epigenetics > Garvan Institute of Medical Research > Darlinghurst NSW 2010 > Australia > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 13.1 years ago Hervé Pagès 16k

0

Entering edit mode

2012/1/22 Hervé Pagès <hpages@fhcrc.org> > Hi Dario, > > Unfortunately the interpretation of the * level in the strand is a > little bit ambiguous: it can mean (a) "range/feature is on both > strands" but it can also mean (b) "strand is unknown" or "strand > is irrelevant". We used to allow NAs in the strand factor in the > early days of the GenomicRanges package to represent (b) so we > were able to distinguish between (a) and (b). But this didn't prove > to be useful enough to justify the "cost" so we decided to not > allow NAs in the strand anymore. > > Another somewhat arbitrary design decision was that all the "set > operations" would treat * as a separate (virtual) strand: > > > library(GenomicRanges) > > x <- GRanges("chr1", IRanges(2, 7), strand="+") > > y <- GRanges("chr1", IRanges(5, 10), strand="*") > > > union(x, y) > > GRanges with 2 ranges and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [2, 7] + > [2] chr1 [5, 10] * > --- > seqlengths: > chr1 > NA > > > intersect(x, y) > > GRanges with 0 ranges and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > --- > seqlengths: > chr1 > NA > > > setdiff(x, y) > GRanges with 1 range and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [2, 7] + > --- > seqlengths: > chr1 > NA > > So, by default (i.e. with ignore.strand=FALSE), setdiff() is just being > consistent with the other set operations. > > Not necessarily the most natural/intuitive behavior but it has the big > advantage to be something that all GenomicRanges methods can agree on, > and to be easy to describe and simple to implement. > > To get what you want, you'll need to do setdiff() twice: > > > strand(y) <- "+" > > z <- setdiff(x, y) > > strand(y) <- "-" > > z <- setdiff(z, y) > > > z > GRanges with 1 range and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [2, 4] + > --- > seqlengths: > chr1 > NA > > As for the weird behavior you get with ignore.strand=TRUE, I agree > it doesn't make much sense and it's not totally clear to me what the > correct behavior should be. Sadly the man page doesn't say anything > about what the intended behavior is (there is only a brief note about > what ignore.strand=TRUE does for parallel set operations), gives no > example, and this case is not covered by the unit tests either. > > However, by looking at what union() and gaps() do with > ignore.strand=TRUE, and also at the internals of the GenomicRanges > package, it seems to me that the intention was that: > > union(x, y, ignore.strand=TRUE) > intersect(x, y, ignore.strand=TRUE) > setdiff(x, y, ignore.strand=TRUE) > > would be equivalent to: > > union(`strand<-`(x, "+"), `strand<-`(y, "+")) > intersect(`strand<-`(x, "+"), `strand<-`(y, "+")) > setdiff(`strand<-`(x, "+"), `strand<-`(y, "+")) > > So I could fix setdiff() and intersect() to do this (intersect() has > a weird behavior too), and update the documentation. Does that sound > reasonable? > > This would be reasonable to me. > Finally I'd like to mention that I see very little value in supporting > the 'ignore.strand' arg in those 3 methods so another option would be > to just get rid of it. > > I think there is value here. Often, one wants to ignore the strand in these operations (say between reads and transcripts) without destroying the strand information. Creating temporary objects is a nuisance. > Cheers, > H. > > > > On 01/15/2012 10:00 PM, Dario Strbenac wrote: > >> Hi, >> >> setdiff doesn't work the way I thought it would when one GRanges has >> strands and the other does not. I thought that features stranded as * would >> be taken as being on both strands. The output is even stranger if I use >> ignore.strand = TRUE. >> >> x >>> >> GRanges with 1 range and 0 elementMetadata values: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] chr1 [2, 7] + >> --- >> seqlengths: >> chr1 >> NA >> >>> y >>> >> GRanges with 1 range and 0 elementMetadata values: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] chr1 [5, 10] * >> --- >> seqlengths: >> chr1 >> NA >> >>> setdiff(x, y) >>> >> GRanges with 1 range and 0 elementMetadata values: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] chr1 [2, 7] + >> --- >> seqlengths: >> chr1 >> NA >> >>> setdiff(x, y, ignore.strand = TRUE) >>> >> GRanges with 2 ranges and 0 elementMetadata values: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] chr1 [1, 10] - >> [2] chr1 [1, 10] * >> --- >> seqlengths: >> chr1 >> NA >> >> How do I get a result of chr1 [2, 4] + ? >> >> R version 2.14.0 (2011-10-31), GenomicRanges_1.6.4, IRanges_1.12.5 >> >> ------------------------------**-------- >> Dario Strbenac >> Research Assistant >> Cancer Epigenetics >> Garvan Institute of Medical Research >> Darlinghurst NSW 2010 >> Australia >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Michael Lawrence ★ 11k

0

Entering edit mode

> > As for the weird behavior you get with ignore.strand=TRUE, I agree > > it doesn't make much sense and it's not totally clear to me what the > > correct behavior should be. Sadly the man page doesn't say anything > > about what the intended behavior is (there is only a brief note about > > what ignore.strand=TRUE does for parallel set operations), gives no > > example, and this case is not covered by the unit tests either. > > > > However, by looking at what union() and gaps() do with > > ignore.strand=TRUE, and also at the internals of the GenomicRanges > > package, it seems to me that the intention was that: > > > > union(x, y, ignore.strand=TRUE) > > intersect(x, y, ignore.strand=TRUE) > > setdiff(x, y, ignore.strand=TRUE) > > > > would be equivalent to: > > > > union(`strand<-`(x, "+"), `strand<-`(y, "+")) > > intersect(`strand<-`(x, "+"), `strand<-`(y, "+")) > > setdiff(`strand<-`(x, "+"), `strand<-`(y, "+")) > > > > So I could fix setdiff() and intersect() to do this (intersect() has > > a weird behavior too), and update the documentation. Does that sound > > reasonable? > > > > > This would be reasonable to me. I think that retaining `ignore.strand` with the proposed semantics of (potentially) altering the strand is inviting further confusion. For example, it would introduce the possibility that the union of x with itself is not all.equal to itself only in the case when the strand of x is not '+'. I think this is a markedly poor outcome since `ignore.strand` is clearly NOT what the operation would be doing. > > Finally I'd like to mention that I see very little value in supporting > > the 'ignore.strand' arg in those 3 methods so another option would be > > to just get rid of it. > > > > > I think there is value here. Often, one wants to ignore the strand in these > operations (say between reads and transcripts) without destroying the > strand information. Creating temporary objects is a nuisance. Perhaps a compromise then..... Since the implementation appears to have been undocumented and the intention unclear, how about replace `ignore.strand` option with a `with.strand=c('+','-','*')` option which sets the GRanges strand prior to performing the operation.

ADD REPLY • link 13.1 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

On Mon, Jan 23, 2012 at 7:12 AM, Cook, Malcolm <mec@stowers.org> wrote: > > > As for the weird behavior you get with ignore.strand=TRUE, I agree > > > it doesn't make much sense and it's not totally clear to me what the > > > correct behavior should be. Sadly the man page doesn't say anything > > > about what the intended behavior is (there is only a brief note about > > > what ignore.strand=TRUE does for parallel set operations), gives no > > > example, and this case is not covered by the unit tests either. > > > > > > However, by looking at what union() and gaps() do with > > > ignore.strand=TRUE, and also at the internals of the GenomicRanges > > > package, it seems to me that the intention was that: > > > > > > union(x, y, ignore.strand=TRUE) > > > intersect(x, y, ignore.strand=TRUE) > > > setdiff(x, y, ignore.strand=TRUE) > > > > > > would be equivalent to: > > > > > > union(`strand<-`(x, "+"), `strand<-`(y, "+")) > > > intersect(`strand<-`(x, "+"), `strand<-`(y, "+")) > > > setdiff(`strand<-`(x, "+"), `strand<-`(y, "+")) > > > > > > So I could fix setdiff() and intersect() to do this (intersect() has > > > a weird behavior too), and update the documentation. Does that sound > > > reasonable? > > > > > > > > This would be reasonable to me. > > I think that retaining `ignore.strand` with the proposed semantics of > (potentially) altering the strand is inviting further confusion. > > For example, it would introduce the possibility that the union of x with > itself is not all.equal to itself only in the case when the strand of x is > not '+'. > > I think this is a markedly poor outcome since `ignore.strand` is clearly > NOT what the operation would be doing. > > I would expect that a set operation ignoring strand would generate ranges with "*" as the strand. There's no way to reliably preserve the strand. So I don't see how the original strand would matter. > > > Finally I'd like to mention that I see very little value in supporting > > > the 'ignore.strand' arg in those 3 methods so another option would be > > > to just get rid of it. > > > > > > > > I think there is value here. Often, one wants to ignore the strand in > these > > operations (say between reads and transcripts) without destroying the > > strand information. Creating temporary objects is a nuisance. > > Perhaps a compromise then..... > > Since the implementation appears to have been undocumented and the > intention unclear, how about replace `ignore.strand` option with a > `with.strand=c('+','-','*')` option which sets the GRanges strand prior to > performing the operation. > > > I would prefer keeping consistent with the rest of the API and having ignore.strand simply be the equivalent of with.strand="*". I acknowledge that the meaning of "*" is ambiguous, but it's easily preferred over "+" and "-". [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael, Malcolm, On 01/23/2012 11:14 AM, Michael Lawrence wrote: > > > On Mon, Jan 23, 2012 at 7:12 AM, Cook, Malcolm <mec at="" stowers.org=""> <mailto:mec at="" stowers.org="">> wrote: > > > > As for the weird behavior you get with ignore.strand=TRUE, I agree > > > it doesn't make much sense and it's not totally clear to me > what the > > > correct behavior should be. Sadly the man page doesn't say anything > > > about what the intended behavior is (there is only a brief note > about > > > what ignore.strand=TRUE does for parallel set operations), gives no > > > example, and this case is not covered by the unit tests either. > > > > > > However, by looking at what union() and gaps() do with > > > ignore.strand=TRUE, and also at the internals of the GenomicRanges > > > package, it seems to me that the intention was that: > > > > > > union(x, y, ignore.strand=TRUE) > > > intersect(x, y, ignore.strand=TRUE) > > > setdiff(x, y, ignore.strand=TRUE) > > > > > > would be equivalent to: > > > > > > union(`strand<-`(x, "+"), `strand<-`(y, "+")) > > > intersect(`strand<-`(x, "+"), `strand<-`(y, "+")) > > > setdiff(`strand<-`(x, "+"), `strand<-`(y, "+")) > > > > > > So I could fix setdiff() and intersect() to do this > (intersect() has > > > a weird behavior too), and update the documentation. Does that > sound > > > reasonable? > > > > > > > > This would be reasonable to me. > > I think that retaining `ignore.strand` with the proposed semantics > of (potentially) altering the strand is inviting further confusion. > > For example, it would introduce the possibility that the union of x > with itself is not all.equal to itself only in the case when the > strand of x is not '+'. > > I think this is a markedly poor outcome since `ignore.strand` is > clearly NOT what the operation would be doing. > > > I would expect that a set operation ignoring strand would generate > ranges with "*" as the strand. There's no way to reliably preserve the > strand. So I don't see how the original strand would matter. I think that Malcom's point is that it's an unpleasant surprise that some very fundamental and intuitive mathematical properties are not honored, e.g. the union (or intersection) of x with itself is not itself. I can live with that one though: this happens only if you use ignore.strand=TRUE, which is not the default. And we still have the property that the union (or intersection) of x with itself is itself *modulo* the strand. Probably good enough. Now whether union(x, y, ignore.strand=TRUE) intersect(x, y, ignore.strand=TRUE) setdiff(x, y, ignore.strand=TRUE) should be equivalent to union(`strand<-`(x, "+"), `strand<-`(y, "+")) intersect(`strand<-`(x, "+"), `strand<-`(y, "+")) setdiff(`strand<-`(x, "+"), `strand<-`(y, "+")) or to union(`strand<-`(x, "*"), `strand<-`(y, "*")) intersect(`strand<-`(x, "*"), `strand<-`(y, "*")) setdiff(`strand<-`(x, "*"), `strand<-`(y, "*")) is just a cosmetic question but I agree that the output of the latter will look prettier (and might be slightly less confusing). So I can do this. Does anybody object? > > > > Finally I'd like to mention that I see very little value in > supporting > > > the 'ignore.strand' arg in those 3 methods so another option > would be > > > to just get rid of it. > > > > > > > > I think there is value here. Often, one wants to ignore the > strand in these > > operations (say between reads and transcripts) without destroying the > > strand information. Creating temporary objects is a nuisance. > > Perhaps a compromise then..... > > Since the implementation appears to have been undocumented and the > intention unclear, how about replace `ignore.strand` option with a > `with.strand=c('+','-','*')` option which sets the GRanges strand > prior to performing the operation. > > > > I would prefer keeping consistent with the rest of the API and having > ignore.strand simply be the equivalent of with.strand="*". I > acknowledge that the meaning of "*" is ambiguous, but it's easily > preferred over "+" and "-". > Yes a lot of methods already have the 'ignore.strand' arg so if we're going to keep that feature for union(), intersect() and setdiff(), I'd rather stick to that API too. Cheers, H.

ADD REPLY • link 13.1 years ago Hervé Pagès 16k

0

Entering edit mode

I'm sold. Excellent compromise. Thanks for considering my opinion..... Cheers! ~Malcolm > -----Original Message----- > From: Hervé Pagès [mailto:hpages at fhcrc.org] > Sent: Monday, January 23, 2012 2:33 PM > To: Michael Lawrence > Cc: Cook, Malcolm; D.Strbenac at garvan.org.au; bioconductor at r-project.org > Subject: Re: [BioC] GRanges setdiff With Features on Both Strands > > Hi Michael, Malcolm, > > On 01/23/2012 11:14 AM, Michael Lawrence wrote: > > > > > > On Mon, Jan 23, 2012 at 7:12 AM, Cook, Malcolm <mec at="" stowers.org=""> > <mailto:mec at="" stowers.org="">> wrote: > > > > > > As for the weird behavior you get with ignore.strand=TRUE, I agree > > > > it doesn't make much sense and it's not totally clear to me > > what the > > > > correct behavior should be. Sadly the man page doesn't say anything > > > > about what the intended behavior is (there is only a brief note > > about > > > > what ignore.strand=TRUE does for parallel set operations), gives no > > > > example, and this case is not covered by the unit tests either. > > > > > > > > However, by looking at what union() and gaps() do with > > > > ignore.strand=TRUE, and also at the internals of the GenomicRanges > > > > package, it seems to me that the intention was that: > > > > > > > > union(x, y, ignore.strand=TRUE) > > > > intersect(x, y, ignore.strand=TRUE) > > > > setdiff(x, y, ignore.strand=TRUE) > > > > > > > > would be equivalent to: > > > > > > > > union(`strand<-`(x, "+"), `strand<-`(y, "+")) > > > > intersect(`strand<-`(x, "+"), `strand<-`(y, "+")) > > > > setdiff(`strand<-`(x, "+"), `strand<-`(y, "+")) > > > > > > > > So I could fix setdiff() and intersect() to do this > > (intersect() has > > > > a weird behavior too), and update the documentation. Does that > > sound > > > > reasonable? > > > > > > > > > > > This would be reasonable to me. > > > > I think that retaining `ignore.strand` with the proposed semantics > > of (potentially) altering the strand is inviting further confusion. > > > > For example, it would introduce the possibility that the union of x > > with itself is not all.equal to itself only in the case when the > > strand of x is not '+'. > > > > I think this is a markedly poor outcome since `ignore.strand` is > > clearly NOT what the operation would be doing. > > > > > > I would expect that a set operation ignoring strand would generate > > ranges with "*" as the strand. There's no way to reliably preserve the > > strand. So I don't see how the original strand would matter. > > I think that Malcom's point is that it's an unpleasant surprise that > some very fundamental and intuitive mathematical properties are not > honored, e.g. the union (or intersection) of x with itself is not > itself. I can live with that one though: this happens only if you > use ignore.strand=TRUE, which is not the default. And we still have > the property that the union (or intersection) of x with itself is > itself *modulo* the strand. Probably good enough. > > Now whether > > union(x, y, ignore.strand=TRUE) > intersect(x, y, ignore.strand=TRUE) > setdiff(x, y, ignore.strand=TRUE) > > should be equivalent to > > union(`strand<-`(x, "+"), `strand<-`(y, "+")) > intersect(`strand<-`(x, "+"), `strand<-`(y, "+")) > setdiff(`strand<-`(x, "+"), `strand<-`(y, "+")) > > or to > > union(`strand<-`(x, "*"), `strand<-`(y, "*")) > intersect(`strand<-`(x, "*"), `strand<-`(y, "*")) > setdiff(`strand<-`(x, "*"), `strand<-`(y, "*")) > > is just a cosmetic question but I agree that the output of the latter > will look prettier (and might be slightly less confusing). So I can do > this. Does anybody object? > > > > > > > Finally I'd like to mention that I see very little value in > > supporting > > > > the 'ignore.strand' arg in those 3 methods so another option > > would be > > > > to just get rid of it. > > > > > > > > > > > I think there is value here. Often, one wants to ignore the > > strand in these > > > operations (say between reads and transcripts) without destroying the > > > strand information. Creating temporary objects is a nuisance. > > > > Perhaps a compromise then..... > > > > Since the implementation appears to have been undocumented and the > > intention unclear, how about replace `ignore.strand` option with a > > `with.strand=c('+','-','*')` option which sets the GRanges strand > > prior to performing the operation. > > > > > > > > I would prefer keeping consistent with the rest of the API and having > > ignore.strand simply be the equivalent of with.strand="*". I > > acknowledge that the meaning of "*" is ambiguous, but it's easily > > preferred over "+" and "-". > > > > Yes a lot of methods already have the 'ignore.strand' arg so if we're > going to keep that feature for union(), intersect() and setdiff(), I'd > rather stick to that API too. > > Cheers, > H.

ADD REPLY • link 13.1 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

>From a biological point of view, I would say the first form. This would happen in a situation like when you have have stranded reads and the other GRanges is a set of repeats locations, which are unstranded blocks in the genome. But the second form is mathematically sound. So, that would be the best form to choose.

ADD REPLY • link 13.1 years ago Dario Strbenac ★ 1.5k

0

Entering edit mode

OK, thanks everybody for their feedback. This is done in GenomicRanges 1.6.6 (release) and 1.7.16 (devel). Summary: when called with 'ignore.strand=TRUE', the "range", "disjoin", "reduce", "union", "intersect" and "setdiff" methods for GRanges or GenomicRanges objects now behave like if they set the strand of the input to "*" before they do any computation. Cheers, H. On 01/23/2012 03:00 PM, Dario Strbenac wrote: >> From a biological point of view, I would say the first form. This would happen in a situation like when you have have stranded reads and the other GRanges is a set of repeats locations, which are unstranded blocks in the genome. But the second form is mathematically sound. So, that would be the best form to choose. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.1 years ago Hervé Pagès 16k

0

Entering edit mode

Hi, 2012/1/23 Hervé Pagès <hpages at="" fhcrc.org="">: [snip] > ?union(`strand<-`(x, "*"), `strand<-`(y, "*")) > ?intersect(`strand<-`(x, "*"), `strand<-`(y, "*")) > ?setdiff(`strand<-`(x, "*"), `strand<-`(y, "*")) > > is just a cosmetic question but I agree that the output of the latter > will look prettier (and might be slightly less confusing). So I can do > this. Does anybody object? Not sure who you're inviting to the tally, but I'm +1 on doing this latter way when ignore.strand=TRUE, ie. setting strand to "*" then doing the set algebra as is suggested in the quoted piece above ... -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.1 years ago Steve Lianoglou ★ 13k

Login before adding your answer.