A code kata: evening-out *Ranges ends.
1
0
Entering edit mode
@steve-lianoglou-2771
Last seen 23 months ago
United States
Hi, Preamble: This is a low priority question. It's basically something I'm just asking to see how one might code it more cleanly. One of Martin's responses to a question today suggested that the person use IRanges::slice to perform a specific task: http://thread.gmane.org/gmane.science.biology.informatics.conductor/30 550/focus=30554 I was quite pleasantly surprised from that post since as one step in cleaning some of my short-read input data. I haven't stumbled on that before, and I essentially had written up a small piece of code that does the same thing that the IRanges::slice operation does, but using normal R operations on a vector-converted Rle object. Using IRanges::slice is much nicer, and no doubt more efficient (although my implementation wasn't terribly inefficient ... wonderful. That's what's prompting me to ask this question. I have another small piece of code that further cleans my data, but this one IS terribly inefficient (it has a for loop). It runs over a (large) I/GRanges object (representing short reads) and sets a "cluster" of reads to have the same start/end. I'll show you what I mean with example data and code below. As some background, my data is coming from a high-throughput SAGE-like protocol, where I get one very specific read/tag per transcript. As a result of some variability in enzyme digestion (either "naturally occurring" in the cell, or as a result of the protocol itself), the resulting reads are one or two bp's off of what they should be. Here's some fake data: reads <- GRanges(seqnames='chr1', strand='+', ranges=IRanges(start=c(sample(18:22, 10, replace=TRUE), sample(100:108, 10, replace=TRUE)), width=sample(18:20, 20, replace=TRUE))) You can see there are two "islands" of reads there, at around positions 20-40, and 100-120. What I want to do is to "fix" the reads in each island to have the same start/end position. Here's code I have to do that (forget any "strand" issues here -- those are already taken care of): fence.posts <- reduce(reads) o <- findOverlaps(fence.posts, reads) subjects <- subjectHits(o) query <- queryHits(o) for (idx in unique(query)) { adjust <- subjects[query == idx] start(reads[adjust]) <- start(fence.posts[idx]) end(reads[adjust]) <- end(fence.posts[idx]) } It works, but it's slow. So, I'm curious if there is a more clever/idiomatic way to do this using the IRanges infrastructure while at the same time being more efficient. Thanks, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Infrastructure Cancer IRanges Infrastructure Cancer IRanges • 1.5k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 23 months ago
United States
In case anyone else is trying to sharpen their IRanges-fu, here's a possible solution to my own question -- works much faster. Instead of: > ?reads <- GRanges(seqnames='chr1', strand='+', > ? ? ? ? ? ? ? ?ranges=IRanges(start=c(sample(18:22, 10, replace=TRUE), > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sample(100:108, 10, replace=TRUE)), > ? ? ? ? ? ? ? ? ?width=sample(18:20, 20, replace=TRUE))) > > You can see there are two "islands" of reads there, at around > positions 20-40, and 100-120. > What I want to do is to "fix" the reads in each island to have the > same start/end position. > > Here's code I have to do that (forget any "strand" issues here -- > those are already taken care of): > > ?fence.posts <- reduce(reads) > ?o <- findOverlaps(fence.posts, reads) > ?subjects <- subjectHits(o) > ?query <- queryHits(o) > > ?for (idx in unique(query)) { > ? ?adjust <- subjects[query == idx] > ? ?start(reads[adjust]) <- start(fence.posts[idx]) > ? ?end(reads[adjust]) <- end(fence.posts[idx]) > ?} I'm doing: covr <- coverage(reads)[[1]] fences <- slice(covr, 1) f.repeat <- viewMaxs(fences) repacked <- IRanges(start=rep(start(fences), f.repeat), end=rep(end(fences), f.repeat)) if (length(repacked) != length(reads)) { stop("Length of repacked reads is not the same as the original") reads <- GRanges(seqnames=seqnames(reads), ranges=repacked, strand=strand(reads)) I'm assuming that all the reads come from the same chromosome (so I unlist the result from coverage), and I'm only sending in reads of the same strand to this part of the function. Anyway, running this newer version on 1000 tags takes 0.181 seconds, where the older version took 20 seconds. Thought some might find it interesting, otherwise sorry to spam. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Quoting Steve Lianoglou <mailinglist.honeypot at="" gmail.com="">: > In case anyone else is trying to sharpen their IRanges-fu, here's a > possible solution to my own question -- works much faster. > > Instead of: > >> ?reads <- GRanges(seqnames='chr1', strand='+', >> ? ? ? ? ? ? ? ?ranges=IRanges(start=c(sample(18:22, 10, replace=TRUE), >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sample(100:108, 10, replace=TRUE)), >> ? ? ? ? ? ? ? ? ?width=sample(18:20, 20, replace=TRUE))) >> >> You can see there are two "islands" of reads there, at around >> positions 20-40, and 100-120. >> What I want to do is to "fix" the reads in each island to have the >> same start/end position. >> >> Here's code I have to do that (forget any "strand" issues here -- >> those are already taken care of): >> >> ?fence.posts <- reduce(reads) >> ?o <- findOverlaps(fence.posts, reads) Just ranges(reads)[subjectHits(o)] <- ranges(fence.posts)[queryHits(o)] ? (I would have done findOverlaps(reads, fence.posts), but...) >> subjects <- subjectHits(o) >> query <- queryHits(o) >> >> ?for (idx in unique(query)) { >> ? ?adjust <- subjects[query == idx] >> ? ?start(reads[adjust]) <- start(fence.posts[idx]) >> ? ?end(reads[adjust]) <- end(fence.posts[idx]) >> ?} > > I'm doing: > > covr <- coverage(reads)[[1]] > fences <- slice(covr, 1) or s = covr > 1 breaks = cumsum(runLength(s))[!runValue(s)] post = cut(start(reads), c(breaks, Inf), labels=FALSE) ranges(reads) = ranges(fence.posts)[post] ? but the method above keeps track of sequences... Martin > f.repeat <- viewMaxs(fences) > repacked <- IRanges(start=rep(start(fences), f.repeat), > end=rep(end(fences), f.repeat)) > if (length(repacked) != length(reads)) { > stop("Length of repacked reads is not the same as the original") > reads <- GRanges(seqnames=seqnames(reads), ranges=repacked, > strand=strand(reads)) > > I'm assuming that all the reads come from the same chromosome (so I > unlist the result from coverage), and I'm only sending in reads of the > same strand to this part of the function. > > Anyway, running this newer version on 1000 tags takes 0.181 seconds, > where the older version took 20 seconds. > > Thought some might find it interesting, otherwise sorry to spam. > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Steve, Riffing off of Martin's post, I would use the match function to do the mapping: fence.posts <- reduce(reads) ranges(reads) <- ranges(fence.posts)[match(reads, fence.posts)] It is not as fast as the Rle/cut based method by Martin, but it is more readable and, thus, easier to maintain. Cheers, Patrick On 8/24/10 9:17 PM, mtmorgan at fhcrc.org wrote: > Quoting Steve Lianoglou <mailinglist.honeypot at="" gmail.com="">: > >> In case anyone else is trying to sharpen their IRanges-fu, here's a >> possible solution to my own question -- works much faster. >> >> Instead of: >> >>> reads <- GRanges(seqnames='chr1', strand='+', >>> ranges=IRanges(start=c(sample(18:22, 10, replace=TRUE), >>> sample(100:108, 10, replace=TRUE)), >>> width=sample(18:20, 20, replace=TRUE))) >>> >>> You can see there are two "islands" of reads there, at around >>> positions 20-40, and 100-120. >>> What I want to do is to "fix" the reads in each island to have the >>> same start/end position. >>> >>> Here's code I have to do that (forget any "strand" issues here -- >>> those are already taken care of): >>> >>> fence.posts <- reduce(reads) >>> o <- findOverlaps(fence.posts, reads) > > Just > > ranges(reads)[subjectHits(o)] <- ranges(fence.posts)[queryHits(o)] > > ? (I would have done findOverlaps(reads, fence.posts), but...) > >>> subjects <- subjectHits(o) >>> query <- queryHits(o) >>> >>> for (idx in unique(query)) { >>> adjust <- subjects[query == idx] >>> start(reads[adjust]) <- start(fence.posts[idx]) >>> end(reads[adjust]) <- end(fence.posts[idx]) >>> } >> >> I'm doing: >> >> covr <- coverage(reads)[[1]] >> fences <- slice(covr, 1) > > or > > s = covr > 1 > breaks = cumsum(runLength(s))[!runValue(s)] > post = cut(start(reads), c(breaks, Inf), labels=FALSE) > ranges(reads) = ranges(fence.posts)[post] > > ? but the method above keeps track of sequences... > > Martin > >> f.repeat <- viewMaxs(fences) >> repacked <- IRanges(start=rep(start(fences), f.repeat), >> end=rep(end(fences), f.repeat)) >> if (length(repacked) != length(reads)) { >> stop("Length of repacked reads is not the same as the original") >> reads <- GRanges(seqnames=seqnames(reads), ranges=repacked, >> strand=strand(reads)) >> >> I'm assuming that all the reads come from the same chromosome (so I >> unlist the result from coverage), and I'm only sending in reads of the >> same strand to this part of the function. >> >> Anyway, running this newer version on 1000 tags takes 0.181 seconds, >> where the older version took 20 seconds. >> >> Thought some might find it interesting, otherwise sorry to spam. >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
On Wed, Aug 25, 2010 at 12:16 AM, Patrick Aboyoun <paboyoun@fhcrc.org>wrote: > Steve, > Riffing off of Martin's post, I would use the match function to do the > mapping: > > fence.posts <- reduce(reads) > ranges(reads) <- ranges(fence.posts)[match(reads, fence.posts)] > > This is the first one that came to mind and is the cleanest, but a cleaner version of the cut method using base R is findInterval: ranges(reads) <- ranges(fence.posts)[findInterval(start(reads), start(fence.posts))] Will be a bit faster than match (which is just findOverlaps(multiple=FALSE)) It is not as fast as the Rle/cut based method by Martin, but it is more > readable and, thus, easier to maintain. > > > Cheers, > Patrick > > > > On 8/24/10 9:17 PM, mtmorgan@fhcrc.org wrote: > >> Quoting Steve Lianoglou <mailinglist.honeypot@gmail.com>: >> >> In case anyone else is trying to sharpen their IRanges-fu, here's a >>> possible solution to my own question -- works much faster. >>> >>> Instead of: >>> >>> reads <- GRanges(seqnames='chr1', strand='+', >>>> ranges=IRanges(start=c(sample(18:22, 10, replace=TRUE), >>>> sample(100:108, 10, replace=TRUE)), >>>> width=sample(18:20, 20, replace=TRUE))) >>>> >>>> You can see there are two "islands" of reads there, at around >>>> positions 20-40, and 100-120. >>>> What I want to do is to "fix" the reads in each island to have the >>>> same start/end position. >>>> >>>> Here's code I have to do that (forget any "strand" issues here -- >>>> those are already taken care of): >>>> >>>> fence.posts <- reduce(reads) >>>> o <- findOverlaps(fence.posts, reads) >>>> >>> >> Just >> >> ranges(reads)[subjectHits(o)] <- ranges(fence.posts)[queryHits(o)] >> >> ? (I would have done findOverlaps(reads, fence.posts), but...) >> >> subjects <- subjectHits(o) >>>> query <- queryHits(o) >>>> >>>> for (idx in unique(query)) { >>>> adjust <- subjects[query == idx] >>>> start(reads[adjust]) <- start(fence.posts[idx]) >>>> end(reads[adjust]) <- end(fence.posts[idx]) >>>> } >>>> >>> >>> I'm doing: >>> >>> covr <- coverage(reads)[[1]] >>> fences <- slice(covr, 1) >>> >> >> or >> >> s = covr > 1 >> breaks = cumsum(runLength(s))[!runValue(s)] >> post = cut(start(reads), c(breaks, Inf), labels=FALSE) >> ranges(reads) = ranges(fence.posts)[post] >> >> ? but the method above keeps track of sequences... >> >> Martin >> >> f.repeat <- viewMaxs(fences) >>> repacked <- IRanges(start=rep(start(fences), f.repeat), >>> end=rep(end(fences), f.repeat)) >>> if (length(repacked) != length(reads)) { >>> stop("Length of repacked reads is not the same as the original") >>> reads <- GRanges(seqnames=seqnames(reads), ranges=repacked, >>> strand=strand(reads)) >>> >>> I'm assuming that all the reads come from the same chromosome (so I >>> unlist the result from coverage), and I'm only sending in reads of the >>> same strand to this part of the function. >>> >>> Anyway, running this newer version on 1000 tags takes 0.181 seconds, >>> where the older version took 20 seconds. >>> >>> Thought some might find it interesting, otherwise sorry to spam. >>> >>> -steve >>> >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> | Memorial Sloan-Kettering Cancer Center >>> | Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/contact<http: cbio.ms="" kcc.org="" %7elianos="" contact=""> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Martin and Patrick, Thanks for the responses, it's always great to see how the ninjas would slice and dice the problem ... On Wed, Aug 25, 2010 at 3:16 AM, Patrick Aboyoun <paboyoun at="" fhcrc.org=""> ... > It is not as fast as the Rle/cut based method by Martin, but it is more > readable and, thus, easier to maintain. "Maintain" ... this is an interesting word. I'll have to look it up sometime .. :-) Thanks again, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY

Login before adding your answer.

Traffic: 824 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6