Entering edit mode
Hi all,
Sorry to wake up this sleeping beauty.
Just something more to add and consider, there is another
inconsistency
in the gaps function for GRanges, when there is no seqinfo present.
The
default gaps will not produce extra ranges for those unused strands,
unless one specify the "end" argument, however only setting "start"
won't trigger the extra results. This is the current behavior of both
release and dev branch,
## rls
GenomicRanges_1.13.36 IRanges_1.19.24
## dev
GenomicRanges_1.12.4 IRanges_1.18.3
## case without seqinfo
g<-GRanges(seqnames="1", IRanges(3, 10), strand='*')
seqinfo=Seqinfo("1",seqlengths=1000))
g
GRanges with 1 range and 0 metadata columns:
seqnames ranges strand
<rle> <iranges> <rle>
[1] 1 [3, 10] *
---
seqlengths:
1
NA
## default, no extra
gaps(g) # this is expected, start always defaults to 1L
GRanges with 1 range and 0 metadata columns:
seqnames ranges strand
<rle> <iranges> <rle>
[1] 1 [1, 2] *
---
seqlengths:
1
NA
## with start, no extra
gaps(g, start=2)
GRanges with 1 range and 0 metadata columns:
seqnames ranges strand
<rle> <iranges> <rle>
[1] 1 [2, 2] *
---
seqlengths:
1
NA
## with end, strand comes into play, giving two more
gaps(g, end=20)
GRanges with 4 ranges and 0 metadata columns:
seqnames ranges strand
<rle> <iranges> <rle>
[1] 1 [ 1, 20] +
[2] 1 [ 1, 20] -
[3] 1 [ 1, 2] *
[4] 1 [11, 20] *
---
seqlengths:
1
NA
## with both start and end, same as above
gaps(g, start=2, end=20) #
GRanges with 4 ranges and 0 metadata columns:
seqnames ranges strand
<rle> <iranges> <rle>
[1] 1 [ 2, 20] +
[2] 1 [ 2, 20] -
[3] 1 [ 2, 2] *
[4] 1 [11, 20] *
---
seqlengths:
1
## reset seqinfo
seqinfo(g)<-Seqinfo("1",seqlengths=1000)
## now the extra ranges are back
gaps(g, start=2)
GRanges with 4 ranges and 0 metadata columns:
seqnames ranges strand
<rle> <iranges> <rle>
[1] 1 [ 2, 1000] +
[2] 1 [ 2, 1000] -
[3] 1 [ 2, 2] *
[4] 1 [11, 1000] *
---
seqlengths:
1
1000
Best,
Dan
On Mon, 2013-06-03 at 05:54 -0700, Michael Lawrence wrote:
> Not sure how confusing this would be, but a common case is (like
Alex's
> input) a set of ranges where the strand is uniform, i.e., all ranges
have
> strand 'x'. In that case, gaps,GenomicRanges could behave just like
> gaps,Ranges and return only ranges on strand 'x'. That would satisfy
both
> properties.
>
> A natural extension would be that if all ranges are '+' or '-' (none
of
> them are '*'), then do not add the '*' range spanning the whole
sequence.
> Also satisfies both properties.
>
> I only know of one use-case where all three strand types are mixed:
> rna-seq, where the strand has been inferred only from the spliced
> alignments. Not if gaps() would even be useful in that case, but if
it
> were, I think the current gaps behavior would make sense.
>
> Michael
>
>
>
>
> On Sun, Jun 2, 2013 at 11:36 PM, Herv Pags <hpages at="" fhcrc.org="">
wrote:
>
> > Hi Alex,
> >
> >
> > On 05/31/2013 01:46 AM, Alex Gutteridge wrote:
> >
> >> I just have a quick question/comment about the behaviour of the
gaps
> >> function in the GenomicRanges package, particularly with how it
handles
> >> * strand ranges. The current behaviour is as below:
> >>
> >> range = GRanges(seqnames="1",IRanges(**start=300,end=700),
strand="*",
> >>> seqinfo=Seqinfo("1",**seqlengths=1000))
> >>> range
> >>>
> >> GRanges with 1 range and 0 metadata columns:
> >> seqnames ranges strand
> >> <rle> <iranges> <rle>
> >> [1] 1 [300, 700] *
> >> ---
> >> seqlengths:
> >> 1
> >> 1000
> >>
> >>> gaps(range)
> >>>
> >> GRanges with 4 ranges and 0 metadata columns:
> >> seqnames ranges strand
> >> <rle> <iranges> <rle>
> >> [1] 1 [ 1, 1000] +
> >> [2] 1 [ 1, 1000] -
> >> [3] 1 [ 1, 299] *
> >> [4] 1 [701, 1000] *
> >> ---
> >> seqlengths:
> >> 1
> >> 1000
> >>
> >> My interpretation of "*" as a strand identifier is that it means
the
> >> range covers both + and - strands and so intuitively I was
expecting the
> >> 'gaps' to only cover the 3rd and 4th ranges returned above (not
the
> >> full-length + and - strand ranges).
> >>
> >
> > It's even worse than that. If there is no range at all on a
chromosome,
> > gaps(gr) will return 3 ranges covering the full chromosome:
> >
> > > gr <- GRanges("chr1",
> > IRanges(start=300,end=700),
> > seqlengths=c(chr1=1000,chr2=**1000))
> >
> > > gr
> >
> > GRanges with 1 range and 0 metadata columns:
> > seqnames ranges strand
> > <rle> <iranges> <rle>
> > [1] chr1 [300, 700] *
> > ---
> > seqlengths:
> > chr1 chr2
> > 1000 1000
> >
> > > gaps(gr)
> >
> > GRanges with 7 ranges and 0 metadata columns:
> > seqnames ranges strand
> > <rle> <iranges> <rle>
> > [1] chr1 [ 1, 1000] +
> > [2] chr1 [ 1, 1000] -
> > [3] chr1 [ 1, 299] *
> > [4] chr1 [701, 1000] *
> > [5] chr2 [ 1, 1000] +
> > [6] chr2 [ 1, 1000] -
> > [7] chr2 [ 1, 1000] *
> > ---
> > seqlengths:
> > chr1 chr2
> > 1000 1000
> >
> >
> > The semantics here implies to me
> >> that the * strand is being thought of as a kind of imaginary
third
> >> strand and hence doesn't overlap with the + or - strands. This is
> >> contrary to the semantics of findOverlaps which does appear to
consider
> >> a * strand range to overlap with + or - strand ranges:
> >>
> >> findOverlaps(range,gaps(range)**)
> >>>
> >> Hits of length 2
> >> queryLength: 1
> >> subjectLength: 4
> >> queryHits subjectHits
> >> <integer> <integer>
> >> 1 1 1
> >> 2 1 2
> >>
> >> To summarise I guess I was expecting
findOverlaps(range,gaps(range)**) to
> >> never return any overlaps under any circumstance (that I can
think of!).
> >>
> >
> > Let's call this property 2. This sounds indeed like a good
property
> > that it would be nice to have. But an even more important property
is
> > property 1: gaps() must be its own reverse operation i.e.
> > 'gaps(gaps(gr))' must always return 'gr'.
> >
> > The current behavior of gaps() guarantees property 1. I'm not
against
> > changing gaps() behavior to also have property 2, as long as
property
> > 1 is preserved. Suggestions are welcome.
> >
> > Thanks,
> > H.
> >
> >
> > Do people agree the behaviour of gaps() is not quite intuitive in
this
> >> case?
> >>
> >> sessionInfo()
> >>>
> >> R version 2.15.2 (2012-10-26)
> >> Platform: x86_64-unknown-linux-gnu (64-bit)
> >>
> >> locale:
> >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> >> [7] LC_PAPER=C LC_NAME=C
> >> [9] LC_ADDRESS=C LC_TELEPHONE=C
> >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>
> >> attached base packages:
> >> [1] stats graphics grDevices utils datasets methods
base
> >>
> >> other attached packages:
> >> [1] GenomicRanges_1.10.7 IRanges_1.16.6 BiocGenerics_0.4.0
> >>
> >> loaded via a namespace (and not attached):
> >> [1] parallel_2.15.2 stats4_2.15.2
> >>
> >>
> > --
> > Herv Pags
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fhcrc.org
> > Phone: (206) 667-5791
> > Fax: (206) 667-1319
> >
> >
> > ______________________________**_________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor="">
> > Search the archives: http://news.gmane.org/gmane.**
> > science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor="">
> >
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor