Semantics of GenomicRanges gaps()

0

Entering edit mode

Dan Du ▴ 210

@dan-du-5270

Last seen 13 months ago

Germany

Hi all, Sorry to wake up this sleeping beauty. Just something more to add and consider, there is another inconsistency in the gaps function for GRanges, when there is no seqinfo present. The default gaps will not produce extra ranges for those unused strands, unless one specify the "end" argument, however only setting "start" won't trigger the extra results. This is the current behavior of both release and dev branch, ## rls GenomicRanges_1.13.36 IRanges_1.19.24 ## dev GenomicRanges_1.12.4 IRanges_1.18.3 ## case without seqinfo g<-GRanges(seqnames="1", IRanges(3, 10), strand='*') seqinfo=Seqinfo("1",seqlengths=1000)) g GRanges with 1 range and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] 1 [3, 10] * --- seqlengths: 1 NA ## default, no extra gaps(g) # this is expected, start always defaults to 1L GRanges with 1 range and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] 1 [1, 2] * --- seqlengths: 1 NA ## with start, no extra gaps(g, start=2) GRanges with 1 range and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] 1 [2, 2] * --- seqlengths: 1 NA ## with end, strand comes into play, giving two more gaps(g, end=20) GRanges with 4 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] 1 [ 1, 20] + [2] 1 [ 1, 20] - [3] 1 [ 1, 2] * [4] 1 [11, 20] * --- seqlengths: 1 NA ## with both start and end, same as above gaps(g, start=2, end=20) # GRanges with 4 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] 1 [ 2, 20] + [2] 1 [ 2, 20] - [3] 1 [ 2, 2] * [4] 1 [11, 20] * --- seqlengths: 1 ## reset seqinfo seqinfo(g)<-Seqinfo("1",seqlengths=1000) ## now the extra ranges are back gaps(g, start=2) GRanges with 4 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] 1 [ 2, 1000] + [2] 1 [ 2, 1000] - [3] 1 [ 2, 2] * [4] 1 [11, 1000] * --- seqlengths: 1 1000 Best, Dan On Mon, 2013-06-03 at 05:54 -0700, Michael Lawrence wrote: > Not sure how confusing this would be, but a common case is (like Alex's > input) a set of ranges where the strand is uniform, i.e., all ranges have > strand 'x'. In that case, gaps,GenomicRanges could behave just like > gaps,Ranges and return only ranges on strand 'x'. That would satisfy both > properties. > > A natural extension would be that if all ranges are '+' or '-' (none of > them are '*'), then do not add the '*' range spanning the whole sequence. > Also satisfies both properties. > > I only know of one use-case where all three strand types are mixed: > rna-seq, where the strand has been inferred only from the spliced > alignments. Not if gaps() would even be useful in that case, but if it > were, I think the current gaps behavior would make sense. > > Michael > > > > > On Sun, Jun 2, 2013 at 11:36 PM, Herv Pags <hpages at="" fhcrc.org=""> wrote: > > > Hi Alex, > > > > > > On 05/31/2013 01:46 AM, Alex Gutteridge wrote: > > > >> I just have a quick question/comment about the behaviour of the gaps > >> function in the GenomicRanges package, particularly with how it handles > >> * strand ranges. The current behaviour is as below: > >> > >> range = GRanges(seqnames="1",IRanges(**start=300,end=700), strand="*", > >>> seqinfo=Seqinfo("1",**seqlengths=1000)) > >>> range > >>> > >> GRanges with 1 range and 0 metadata columns: > >> seqnames ranges strand > >> <rle> <iranges> <rle> > >> [1] 1 [300, 700] * > >> --- > >> seqlengths: > >> 1 > >> 1000 > >> > >>> gaps(range) > >>> > >> GRanges with 4 ranges and 0 metadata columns: > >> seqnames ranges strand > >> <rle> <iranges> <rle> > >> [1] 1 [ 1, 1000] + > >> [2] 1 [ 1, 1000] - > >> [3] 1 [ 1, 299] * > >> [4] 1 [701, 1000] * > >> --- > >> seqlengths: > >> 1 > >> 1000 > >> > >> My interpretation of "*" as a strand identifier is that it means the > >> range covers both + and - strands and so intuitively I was expecting the > >> 'gaps' to only cover the 3rd and 4th ranges returned above (not the > >> full-length + and - strand ranges). > >> > > > > It's even worse than that. If there is no range at all on a chromosome, > > gaps(gr) will return 3 ranges covering the full chromosome: > > > > > gr <- GRanges("chr1", > > IRanges(start=300,end=700), > > seqlengths=c(chr1=1000,chr2=**1000)) > > > > > gr > > > > GRanges with 1 range and 0 metadata columns: > > seqnames ranges strand > > <rle> <iranges> <rle> > > [1] chr1 [300, 700] * > > --- > > seqlengths: > > chr1 chr2 > > 1000 1000 > > > > > gaps(gr) > > > > GRanges with 7 ranges and 0 metadata columns: > > seqnames ranges strand > > <rle> <iranges> <rle> > > [1] chr1 [ 1, 1000] + > > [2] chr1 [ 1, 1000] - > > [3] chr1 [ 1, 299] * > > [4] chr1 [701, 1000] * > > [5] chr2 [ 1, 1000] + > > [6] chr2 [ 1, 1000] - > > [7] chr2 [ 1, 1000] * > > --- > > seqlengths: > > chr1 chr2 > > 1000 1000 > > > > > > The semantics here implies to me > >> that the * strand is being thought of as a kind of imaginary third > >> strand and hence doesn't overlap with the + or - strands. This is > >> contrary to the semantics of findOverlaps which does appear to consider > >> a * strand range to overlap with + or - strand ranges: > >> > >> findOverlaps(range,gaps(range)**) > >>> > >> Hits of length 2 > >> queryLength: 1 > >> subjectLength: 4 > >> queryHits subjectHits > >> <integer> <integer> > >> 1 1 1 > >> 2 1 2 > >> > >> To summarise I guess I was expecting findOverlaps(range,gaps(range)**) to > >> never return any overlaps under any circumstance (that I can think of!). > >> > > > > Let's call this property 2. This sounds indeed like a good property > > that it would be nice to have. But an even more important property is > > property 1: gaps() must be its own reverse operation i.e. > > 'gaps(gaps(gr))' must always return 'gr'. > > > > The current behavior of gaps() guarantees property 1. I'm not against > > changing gaps() behavior to also have property 2, as long as property > > 1 is preserved. Suggestions are welcome. > > > > Thanks, > > H. > > > > > > Do people agree the behaviour of gaps() is not quite intuitive in this > >> case? > >> > >> sessionInfo() > >>> > >> R version 2.15.2 (2012-10-26) > >> Platform: x86_64-unknown-linux-gnu (64-bit) > >> > >> locale: > >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > >> [7] LC_PAPER=C LC_NAME=C > >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] GenomicRanges_1.10.7 IRanges_1.16.6 BiocGenerics_0.4.0 > >> > >> loaded via a namespace (and not attached): > >> [1] parallel_2.15.2 stats4_2.15.2 > >> > >> > > -- > > Herv Pags > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fhcrc.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > > > > > ______________________________**_________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> > > Search the archives: http://news.gmane.org/gmane.** > > science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor=""> > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Cancer GenomicRanges trigger Cancer GenomicRanges trigger • 2.4k views

ADD COMMENT • link updated 11.5 years ago by Valerie Obenchain ★ 6.8k • written 11.5 years ago by Dan Du ▴ 210

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.1 years ago

United States

Hi Dan, Thanks for catching this. Herve is currently out of town. We'll put this on the TODO and address it when he is back. Valerie On 08/21/2013 12:13 AM, Dan Du wrote: > Hi all, > > Sorry to wake up this sleeping beauty. > > Just something more to add and consider, there is another inconsistency > in the gaps function for GRanges, when there is no seqinfo present. The > default gaps will not produce extra ranges for those unused strands, > unless one specify the "end" argument, however only setting "start" > won't trigger the extra results. This is the current behavior of both > release and dev branch, > > ## rls > GenomicRanges_1.13.36 IRanges_1.19.24 > ## dev > GenomicRanges_1.12.4 IRanges_1.18.3 > > ## case without seqinfo > g<-GRanges(seqnames="1", IRanges(3, 10), strand='*') > seqinfo=Seqinfo("1",seqlengths=1000)) > g > GRanges with 1 range and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] 1 [3, 10] * > --- > seqlengths: > 1 > NA > > ## default, no extra > gaps(g) # this is expected, start always defaults to 1L > GRanges with 1 range and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] 1 [1, 2] * > --- > seqlengths: > 1 > NA > > ## with start, no extra > gaps(g, start=2) > GRanges with 1 range and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] 1 [2, 2] * > --- > seqlengths: > 1 > NA > ## with end, strand comes into play, giving two more > gaps(g, end=20) > GRanges with 4 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] 1 [ 1, 20] + > [2] 1 [ 1, 20] - > [3] 1 [ 1, 2] * > [4] 1 [11, 20] * > --- > seqlengths: > 1 > NA > > ## with both start and end, same as above > gaps(g, start=2, end=20) # > GRanges with 4 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] 1 [ 2, 20] + > [2] 1 [ 2, 20] - > [3] 1 [ 2, 2] * > [4] 1 [11, 20] * > --- > seqlengths: > 1 > ## reset seqinfo > seqinfo(g)<-Seqinfo("1",seqlengths=1000) > > ## now the extra ranges are back > gaps(g, start=2) > GRanges with 4 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] 1 [ 2, 1000] + > [2] 1 [ 2, 1000] - > [3] 1 [ 2, 2] * > [4] 1 [11, 1000] * > --- > seqlengths: > 1 > 1000 > > Best, > Dan > > On Mon, 2013-06-03 at 05:54 -0700, Michael Lawrence wrote: >> Not sure how confusing this would be, but a common case is (like Alex's >> input) a set of ranges where the strand is uniform, i.e., all ranges have >> strand 'x'. In that case, gaps,GenomicRanges could behave just like >> gaps,Ranges and return only ranges on strand 'x'. That would satisfy both >> properties. >> >> A natural extension would be that if all ranges are '+' or '-' (none of >> them are '*'), then do not add the '*' range spanning the whole sequence. >> Also satisfies both properties. >> >> I only know of one use-case where all three strand types are mixed: >> rna-seq, where the strand has been inferred only from the spliced >> alignments. Not if gaps() would even be useful in that case, but if it >> were, I think the current gaps behavior would make sense. >> >> Michael >> >> >> >> >> On Sun, Jun 2, 2013 at 11:36 PM, Herv Pags <hpages at="" fhcrc.org=""> wrote: >> >>> Hi Alex, >>> >>> >>> On 05/31/2013 01:46 AM, Alex Gutteridge wrote: >>> >>>> I just have a quick question/comment about the behaviour of the gaps >>>> function in the GenomicRanges package, particularly with how it handles >>>> * strand ranges. The current behaviour is as below: >>>> >>>> range = GRanges(seqnames="1",IRanges(**start=300,end=700), strand="*", >>>>> seqinfo=Seqinfo("1",**seqlengths=1000)) >>>>> range >>>>> >>>> GRanges with 1 range and 0 metadata columns: >>>> seqnames ranges strand >>>> <rle> <iranges> <rle> >>>> [1] 1 [300, 700] * >>>> --- >>>> seqlengths: >>>> 1 >>>> 1000 >>>> >>>>> gaps(range) >>>>> >>>> GRanges with 4 ranges and 0 metadata columns: >>>> seqnames ranges strand >>>> <rle> <iranges> <rle> >>>> [1] 1 [ 1, 1000] + >>>> [2] 1 [ 1, 1000] - >>>> [3] 1 [ 1, 299] * >>>> [4] 1 [701, 1000] * >>>> --- >>>> seqlengths: >>>> 1 >>>> 1000 >>>> >>>> My interpretation of "*" as a strand identifier is that it means the >>>> range covers both + and - strands and so intuitively I was expecting the >>>> 'gaps' to only cover the 3rd and 4th ranges returned above (not the >>>> full-length + and - strand ranges). >>>> >>> >>> It's even worse than that. If there is no range at all on a chromosome, >>> gaps(gr) will return 3 ranges covering the full chromosome: >>> >>> > gr <- GRanges("chr1", >>> IRanges(start=300,end=700), >>> seqlengths=c(chr1=1000,chr2=**1000)) >>> >>> > gr >>> >>> GRanges with 1 range and 0 metadata columns: >>> seqnames ranges strand >>> <rle> <iranges> <rle> >>> [1] chr1 [300, 700] * >>> --- >>> seqlengths: >>> chr1 chr2 >>> 1000 1000 >>> >>> > gaps(gr) >>> >>> GRanges with 7 ranges and 0 metadata columns: >>> seqnames ranges strand >>> <rle> <iranges> <rle> >>> [1] chr1 [ 1, 1000] + >>> [2] chr1 [ 1, 1000] - >>> [3] chr1 [ 1, 299] * >>> [4] chr1 [701, 1000] * >>> [5] chr2 [ 1, 1000] + >>> [6] chr2 [ 1, 1000] - >>> [7] chr2 [ 1, 1000] * >>> --- >>> seqlengths: >>> chr1 chr2 >>> 1000 1000 >>> >>> >>> The semantics here implies to me >>>> that the * strand is being thought of as a kind of imaginary third >>>> strand and hence doesn't overlap with the + or - strands. This is >>>> contrary to the semantics of findOverlaps which does appear to consider >>>> a * strand range to overlap with + or - strand ranges: >>>> >>>> findOverlaps(range,gaps(range)**) >>>>> >>>> Hits of length 2 >>>> queryLength: 1 >>>> subjectLength: 4 >>>> queryHits subjectHits >>>> <integer> <integer> >>>> 1 1 1 >>>> 2 1 2 >>>> >>>> To summarise I guess I was expecting findOverlaps(range,gaps(range)**) to >>>> never return any overlaps under any circumstance (that I can think of!). >>>> >>> >>> Let's call this property 2. This sounds indeed like a good property >>> that it would be nice to have. But an even more important property is >>> property 1: gaps() must be its own reverse operation i.e. >>> 'gaps(gaps(gr))' must always return 'gr'. >>> >>> The current behavior of gaps() guarantees property 1. I'm not against >>> changing gaps() behavior to also have property 2, as long as property >>> 1 is preserved. Suggestions are welcome. >>> >>> Thanks, >>> H. >>> >>> >>> Do people agree the behaviour of gaps() is not quite intuitive in this >>>> case? >>>> >>>> sessionInfo() >>>>> >>>> R version 2.15.2 (2012-10-26) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] GenomicRanges_1.10.7 IRanges_1.16.6 BiocGenerics_0.4.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] parallel_2.15.2 stats4_2.15.2 >>>> >>>> >>> -- >>> Herv Pags >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: http://news.gmane.org/gmane.** >>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor=""> >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 11.5 years ago Valerie Obenchain ★ 6.8k

Login before adding your answer.