GRanges apply functions
4
0
Entering edit mode
@lescai-francesco-5078
Last seen 6.2 years ago
Denmark
Hi guys, I’ve seen this issue addressed previously, but I couldn’t understand if it’s been implemented in some ways. I’d like to go through a GRanges object by row - or interval - (let’s say variants, or genes) and perform a function (ex. to annotate with additional metadata). I can do that with for (i in 1:length(variants)){ #do something with variants[i,] data } but it’s quite slow. as someone else asked in the past, something like apply(variants, 1, myFunction) or lapply(variants, myFunction) would be great. is there something like grapply? :) Any advice? thanks, Francesco [[alternative HTML version deleted]]
GO annotate GO annotate • 4.0k views
ADD COMMENT
1
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States
lapply(gr, FUN) should work, but it will be slow, because it constructs a new GRanges each time. This could in theory be optimized at some low level, but it's generally best to avoid this type of iteration. Maybe you could share your specific problem and we could help with this. Michael On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai <lescai@biomed.au.dk> wrote: > Hi guys, > I’ve seen this issue addressed previously, but I couldn’t understand if > it’s been implemented in some ways. > > I’d like to go through a GRanges object by row - or interval - (let’s say > variants, or genes) and perform a function (ex. to annotate with additional > metadata). > I can do that with > > for (i in 1:length(variants)){ > #do something with variants[i,] data > } > > but it’s quite slow. > as someone else asked in the past, something like > apply(variants, 1, myFunction) or > lapply(variants, myFunction) > would be great. > is there something like grapply? :) > > Any advice? > > thanks, > Francesco > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
@lescai-francesco-5078
Last seen 6.2 years ago
Denmark
Hi guys, I’ve seen this issue addressed previously, but I couldn’t understand if it’s been implemented in some ways. I’d like to go through a GRanges object by row - or interval - (let’s say variants, or genes) and perform a function (ex. to annotate with additional metadata). I can do that with for (i in 1:length(variants)){ #do something with variants[i,] data } but it’s quite slow. as someone else asked in the past, something like apply(variants, 1, myFunction) or lapply(variants, myFunction) would be great. is there something like grapply? :) Any advice? thanks, Francesco [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
hi Francesco, i think that instead of going through variants annotating at each everything you need and trying to parallelize the iterating through variants, it will be more efficient to annotate one kind of information at a time over all variants vector-wise. if this vector-wise operation is too big (dealing with thousands, or hundreds of thousands, of variants) then parallelize that annotation vector-wise operation spliting the variants by chromosome, or via BiocParallel::bpvec(). this is what i try to do in the VariantFiltering package, although i still have to exploit parallelism for a number of annotations, which is in my TODO list. cheers, robert. On 6/19/14 10:37 AM, Francesco Lescai wrote: > Hi guys, > I've seen this issue addressed previously, but I couldn't understand if it's been implemented in some ways. > > I'd like to go through a GRanges object by row - or interval - (let's say variants, or genes) and perform a function (ex. to annotate with additional metadata). > I can do that with > > for (i in 1:length(variants)){ > #do something with variants[i,] data > } > > but it's quite slow. > as someone else asked in the past, something like > apply(variants, 1, myFunction) or > lapply(variants, myFunction) > would be great. > is there something like grapply? :) > > Any advice? > > thanks, > Francesco > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Tim Triche ★ 4.2k
@tim-triche-3561
Last seen 4.2 years ago
United States
Is there some way to use a reference class to iterate over a GRanges- like structure without actually copying it (or at least not copying it more than once)? I do stupid things like this on a fairly regular basis. Come to think of it, computing overlaps of various types could be optimized like this, it seems. I will have to monkey around with this and see how bad of an idea it is. Statistics is the grammar of science. Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence <lawrence.michael@gene.com> wrote: > lapply(gr, FUN) should work, but it will be slow, because it constructs a > new GRanges each time. This could in theory be optimized at some low level, > but it's generally best to avoid this type of iteration. Maybe you could > share your specific problem and we could help with this. > > Michael > > > On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai <lescai@biomed.au.dk> > wrote: > > > Hi guys, > > I’ve seen this issue addressed previously, but I couldn’t understand if > > it’s been implemented in some ways. > > > > I’d like to go through a GRanges object by row - or interval - (let’s say > > variants, or genes) and perform a function (ex. to annotate with > additional > > metadata). > > I can do that with > > > > for (i in 1:length(variants)){ > > #do something with variants[i,] data > > } > > > > but it’s quite slow. > > as someone else asked in the past, something like > > apply(variants, 1, myFunction) or > > lapply(variants, myFunction) > > would be great. > > is there something like grapply? :) > > > > Any advice? > > > > thanks, > > Francesco > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]] > > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
It's all the overhead in constructing the object that hurts, of which the copying (of small vectors) is only a small piece. I assume you mean layering some sort of "view" on the GRanges that represents a subset, without actually forming the new object (unless there is an attempt to write to it). There's no need for a reference class to implement that, but the overhead of the view might end up being just as bad, depending. And such loops would be still be much slower than the vectorized alternative. On Thu, Jun 19, 2014 at 9:12 AM, Tim Triche, Jr. <tim.triche@gmail.com> wrote: > Is there some way to use a reference class to iterate over a GRanges-like > structure without actually copying it (or at least not copying it more than > once)? I do stupid things like this on a fairly regular basis. Come to > think of it, computing overlaps of various types could be optimized like > this, it seems. I will have to monkey around with this and see how bad of > an idea it is. > > > Statistics is the grammar of science. > Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> > > > On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence < > lawrence.michael@gene.com> wrote: > >> lapply(gr, FUN) should work, but it will be slow, because it constructs a >> new GRanges each time. This could in theory be optimized at some low >> level, >> but it's generally best to avoid this type of iteration. Maybe you could >> share your specific problem and we could help with this. >> >> Michael >> >> >> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai <lescai@biomed.au.dk> >> wrote: >> >> > Hi guys, >> > I’ve seen this issue addressed previously, but I couldn’t understand if >> > it’s been implemented in some ways. >> > >> > I’d like to go through a GRanges object by row - or interval - (let’s >> say >> > variants, or genes) and perform a function (ex. to annotate with >> additional >> > metadata). >> > I can do that with >> > >> > for (i in 1:length(variants)){ >> > #do something with variants[i,] data >> > } >> > >> > but it’s quite slow. >> > as someone else asked in the past, something like >> > apply(variants, 1, myFunction) or >> > lapply(variants, myFunction) >> > would be great. >> > is there something like grapply? :) >> > >> > Any advice? >> > >> > thanks, >> > Francesco >> > >> > >> > [[alternative HTML version deleted]] >> > >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> [[alternative HTML version deleted]] >> >> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Tim Triche ★ 4.2k
@tim-triche-3561
Last seen 4.2 years ago
United States
Ah, what I usually do is split(GR) and then lapplysplit.GR, some.function), which is what I was thinking about. It's probably better for me to use BiocParallel in this situation, although if I didn't HAVE to use it for such a thing -- if I could just point to the pieces and walk over them -- that was where I thought a reference might help. Thanks, --t Statistics is the grammar of science. Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> On Thu, Jun 19, 2014 at 9:37 AM, Michael Lawrence <lawrence.michael@gene.com> wrote: > It's all the overhead in constructing the object that hurts, of which > the copying (of small vectors) is only a small piece. I assume you mean > layering some sort of "view" on the GRanges that represents a subset, > without actually forming the new object (unless there is an attempt to > write to it). There's no need for a reference class to implement that, but > the overhead of the view might end up being just as bad, depending. And > such loops would be still be much slower than the vectorized alternative. > > > On Thu, Jun 19, 2014 at 9:12 AM, Tim Triche, Jr. <tim.triche@gmail.com> > wrote: > >> Is there some way to use a reference class to iterate over a GRanges-like >> structure without actually copying it (or at least not copying it more than >> once)? I do stupid things like this on a fairly regular basis. Come to >> think of it, computing overlaps of various types could be optimized like >> this, it seems. I will have to monkey around with this and see how bad of >> an idea it is. >> >> >> Statistics is the grammar of science. >> Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> >> >> >> On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence < >> lawrence.michael@gene.com> wrote: >> >>> lapply(gr, FUN) should work, but it will be slow, because it constructs a >>> new GRanges each time. This could in theory be optimized at some low >>> level, >>> but it's generally best to avoid this type of iteration. Maybe you could >>> share your specific problem and we could help with this. >>> >>> Michael >>> >>> >>> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai <lescai@biomed.au.dk> >>> wrote: >>> >>> > Hi guys, >>> > I’ve seen this issue addressed previously, but I couldn’t understand if >>> > it’s been implemented in some ways. >>> > >>> > I’d like to go through a GRanges object by row - or interval - (let’s >>> say >>> > variants, or genes) and perform a function (ex. to annotate with >>> additional >>> > metadata). >>> > I can do that with >>> > >>> > for (i in 1:length(variants)){ >>> > #do something with variants[i,] data >>> > } >>> > >>> > but it’s quite slow. >>> > as someone else asked in the past, something like >>> > apply(variants, 1, myFunction) or >>> > lapply(variants, myFunction) >>> > would be great. >>> > is there something like grapply? :) >>> > >>> > Any advice? >>> > >>> > thanks, >>> > Francesco >>> > >>> > >>> > [[alternative HTML version deleted]] >>> > >>> > >>> > _______________________________________________ >>> > Bioconductor mailing list >>> > Bioconductor@r-project.org >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> > Search the archives: >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > >>> >>> [[alternative HTML version deleted]] >>> >>> >> > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Dear All, I also frequently do the split-apply idiom on GRanges. A simple example is to 'reduce' exons on a per-gene_id basis (can easily take ~0.5h for the gencode GTF). Sometimes I use bplapply, however it is still quite slow - would be great if this could be done faster. Yours, Marcin On Thu, Jun 19, 2014 at 12:42 PM, Tim Triche, Jr. <tim.triche@gmail.com> wrote: > Ah, what I usually do is split(GR) and then lapplysplit.GR, > some.function), which is what I was thinking about. It's probably better > for me to use BiocParallel in this situation, although if I didn't HAVE to > use it for such a thing -- if I could just point to the pieces and walk > over them -- that was where I thought a reference might help. > > Thanks, > > --t > > > > Statistics is the grammar of science. > Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> > > > On Thu, Jun 19, 2014 at 9:37 AM, Michael Lawrence < > lawrence.michael@gene.com > > wrote: > > > It's all the overhead in constructing the object that hurts, of which > > the copying (of small vectors) is only a small piece. I assume you mean > > layering some sort of "view" on the GRanges that represents a subset, > > without actually forming the new object (unless there is an attempt to > > write to it). There's no need for a reference class to implement that, > but > > the overhead of the view might end up being just as bad, depending. And > > such loops would be still be much slower than the vectorized alternative. > > > > > > On Thu, Jun 19, 2014 at 9:12 AM, Tim Triche, Jr. <tim.triche@gmail.com> > > wrote: > > > >> Is there some way to use a reference class to iterate over a > GRanges-like > >> structure without actually copying it (or at least not copying it more > than > >> once)? I do stupid things like this on a fairly regular basis. Come to > >> think of it, computing overlaps of various types could be optimized like > >> this, it seems. I will have to monkey around with this and see how bad > of > >> an idea it is. > >> > >> > >> Statistics is the grammar of science. > >> Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> > >> > >> > >> On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence < > >> lawrence.michael@gene.com> wrote: > >> > >>> lapply(gr, FUN) should work, but it will be slow, because it > constructs a > >>> new GRanges each time. This could in theory be optimized at some low > >>> level, > >>> but it's generally best to avoid this type of iteration. Maybe you > could > >>> share your specific problem and we could help with this. > >>> > >>> Michael > >>> > >>> > >>> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai <lescai@biomed.au.dk> > > >>> wrote: > >>> > >>> > Hi guys, > >>> > I’ve seen this issue addressed previously, but I couldn’t understand > if > >>> > it’s been implemented in some ways. > >>> > > >>> > I’d like to go through a GRanges object by row - or interval - (let’s > >>> say > >>> > variants, or genes) and perform a function (ex. to annotate with > >>> additional > >>> > metadata). > >>> > I can do that with > >>> > > >>> > for (i in 1:length(variants)){ > >>> > #do something with variants[i,] data > >>> > } > >>> > > >>> > but it’s quite slow. > >>> > as someone else asked in the past, something like > >>> > apply(variants, 1, myFunction) or > >>> > lapply(variants, myFunction) > >>> > would be great. > >>> > is there something like grapply? :) > >>> > > >>> > Any advice? > >>> > > >>> > thanks, > >>> > Francesco > >>> > > >>> > > >>> > [[alternative HTML version deleted]] > >>> > > >>> > > >>> > _______________________________________________ > >>> > Bioconductor mailing list > >>> > Bioconductor@r-project.org > >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> > Search the archives: > >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> > >> > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
That particular use case is easy and fast, because reduce() works on GRangesList. reduced_exons_by_gene <- reduce(exonsBy(txdb, "gene")) In general, many of the high-level Lists for these data structures have efficient underlying representations, and there are methods that are smart enough to take advantage of them. Whenever thinking about resorting to explicit iteration, first ask for help, since it's likely someone has already come across the use case and optimized it. Michael On Thu, Jun 19, 2014 at 9:51 AM, Marcin Cieślik <marcin.cieslik@gmail.com> wrote: > Dear All, > > I also frequently do the split-apply idiom on GRanges. A simple example is > to 'reduce' exons on a per-gene_id basis (can easily take ~0.5h for the > gencode GTF). Sometimes I use bplapply, however it is still quite slow - > would be great if this could be done faster. > > Yours, > Marcin > > > On Thu, Jun 19, 2014 at 12:42 PM, Tim Triche, Jr. <tim.triche@gmail.com> > wrote: > >> Ah, what I usually do is split(GR) and then lapplysplit.GR, >> some.function), which is what I was thinking about. It's probably better >> for me to use BiocParallel in this situation, although if I didn't HAVE to >> use it for such a thing -- if I could just point to the pieces and walk >> over them -- that was where I thought a reference might help. >> >> Thanks, >> >> --t >> >> >> >> Statistics is the grammar of science. >> Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> >> >> >> On Thu, Jun 19, 2014 at 9:37 AM, Michael Lawrence < >> lawrence.michael@gene.com >> > wrote: >> >> > It's all the overhead in constructing the object that hurts, of which >> > the copying (of small vectors) is only a small piece. I assume you mean >> > layering some sort of "view" on the GRanges that represents a subset, >> > without actually forming the new object (unless there is an attempt to >> > write to it). There's no need for a reference class to implement that, >> but >> > the overhead of the view might end up being just as bad, depending. And >> > such loops would be still be much slower than the vectorized >> alternative. >> > >> > >> > On Thu, Jun 19, 2014 at 9:12 AM, Tim Triche, Jr. <tim.triche@gmail.com> >> > wrote: >> > >> >> Is there some way to use a reference class to iterate over a >> GRanges-like >> >> structure without actually copying it (or at least not copying it more >> than >> >> once)? I do stupid things like this on a fairly regular basis. Come >> to >> >> think of it, computing overlaps of various types could be optimized >> like >> >> this, it seems. I will have to monkey around with this and see how >> bad of >> >> an idea it is. >> >> >> >> >> >> Statistics is the grammar of science. >> >> Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> >> >> >> >> >> >> >> On Thu, Jun 19, 2014 at 5:08 AM, Michael Lawrence < >> >> lawrence.michael@gene.com> wrote: >> >> >> >>> lapply(gr, FUN) should work, but it will be slow, because it >> constructs a >> >>> new GRanges each time. This could in theory be optimized at some low >> >>> level, >> >>> but it's generally best to avoid this type of iteration. Maybe you >> could >> >>> share your specific problem and we could help with this. >> >>> >> >>> Michael >> >>> >> >>> >> >>> On Thu, Jun 19, 2014 at 1:40 AM, Francesco Lescai < >> lescai@biomed.au.dk> >> >>> wrote: >> >>> >> >>> > Hi guys, >> >>> > I’ve seen this issue addressed previously, but I couldn’t >> understand if >> >>> > it’s been implemented in some ways. >> >>> > >> >>> > I’d like to go through a GRanges object by row - or interval - >> (let’s >> >>> say >> >>> > variants, or genes) and perform a function (ex. to annotate with >> >>> additional >> >>> > metadata). >> >>> > I can do that with >> >>> > >> >>> > for (i in 1:length(variants)){ >> >>> > #do something with variants[i,] data >> >>> > } >> >>> > >> >>> > but it’s quite slow. >> >>> > as someone else asked in the past, something like >> >>> > apply(variants, 1, myFunction) or >> >>> > lapply(variants, myFunction) >> >>> > would be great. >> >>> > is there something like grapply? :) >> >>> > >> >>> > Any advice? >> >>> > >> >>> > thanks, >> >>> > Francesco >> >>> > >> >>> > >> >>> > [[alternative HTML version deleted]] >> >>> > >> >>> > >> >>> > _______________________________________________ >> >>> > Bioconductor mailing list >> >>> > Bioconductor@r-project.org >> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> > Search the archives: >> >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>> > >> >>> >> >>> [[alternative HTML version deleted]] >> >>> >> >>> >> >> >> > >> >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 582 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6