IRanges: Request for a "step" argument in runsum

0

Entering edit mode

Arnaud Amzallag ▴ 100

@arnaud-amzallag-4471

Last seen 8.1 years ago

Hello Hervé and Michael, On Jun 29, 2011, at 6:29 PM, Michael Lawrence wrote: > > > 2011/6/28 Hervé Pagès <hpages@fhcrc.org> > Hi Michael, Arnaud, > > > On 11-05-25 08:21 AM, Arnaud Amzallag wrote: > Thank you Michael, > > for some email filters reasons I saw your reply only now. > > I recon that in my case that would have been much smoother if sum() would > call viewSums() by default and I agree that "It's more intuitive to think of > an RleViews like an RleList rather than an IntegerList.". I would support > that change. > > I've recently made some changes in that direction in IRanges devel. > This subtle shift in the nature of an RleViews (from IntegerList > to RleList) has some deep consequences on the hierarchy of classes > in IRanges. The most important one is that the Views class doesn't > extend the IRanges class anymore. Views is now a direct subclass > of List. This means that you cannot directly manipulate a Views object > as if it was an IRanges object but you must extract its ranges (with > ranges()) first. > > > This is pretty unfortunate. We lose the ability to e.g. store the coverage inside RangedData (because a ViewsList is a RangesList). Is it really necessary for what was proposed? > I don't really have a point of view on this. Although I am curious, are you keeping coverage and annotation data in the same object, Michael ? Does it have some advantage ? I am simply constructing the Views with the annotation each time I need it. > > This is a work-in-progress and a few things still need to be polished. > > Also I agree that we should just have "min", "max", "sum", "mean" etc > methods for Views objects. No need for viewMins, viewMaxs, viewSums etc > I'll change this. > Thank you Hervé, this is probably a good idea. Thank you all for the good work ! Arnaud > Cheers, > H. > > > Also it is possible that before I was summing the values of the Rle and did > not notice the difference because my Rle was made of a lot of very short Rle > lengths. > > Arnaud > > On Tue, May 10, 2011 at 8:44 AM, Michael Lawrence<lawrence.michael@gene.com> wrote: > > Good to hear that helped. One might expect sum() to simply call viewSums(), > but the semantics are a bit strange here. The reason sum() works on Views is > that a Views is a Ranges and thus an IntegerList (where each range encodes a > sequence of integers). The weird thing is that the elements of a Views are > not the sequence of integers covered but rather the values in the Rle. That > everything works as you expected is just a coincidence of dispatch. > > For usability we should probably have max(), min(), and sum() just use > viewMaxs, viewMins and viewSums. It's more intuitive to think of an RleViews > like an RleList rather than an IntegerList. > > > On Sun, May 8, 2011 at 1:44 PM, Arnaud Amzallag<arnaud.amzallag@gmail.com> wrote: > > Thank you Michael, the function viewSums was exactly what I needed ! > > 0.014 seconds for viewSums(Views(myrle, ir)) vs 54 seconds for > sum(Views(myrle, ir)) on chr22, one sample. I use this now instead of of > runsum, no problem of memory, and probably even faster. for full the genome > on many samples that will surely help. Maybe I should have read a bit more > about the Views. > > About the result of runsum, I did see a lot of memory usage when I split > the process with mclapply. The result is indeed a Rle. After looking closer, > the resuting Rle has much more runs that the original one. That makes sense, > because runsum is a kind of smoothing function, and the resulting signal has > much more levels than the original one. > > Kind regards, > > Arnaud > > On May 6, 2011, at 10:42 PM, Michael Lawrence wrote: > > > > On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag< > arnaud.amzallag@gmail.com> wrote: > > Dear IRanges developers, > > runsum is a very fast and convenient function to compute on Rle > coverages, for instance. However when it is run on several chromosomes and > several samples, it can get very memory intensive. For instance on human > chromosome 1, it outputs a vector of length 250 millions, so for several > full genomes it is quickly billions of numbers in memory. > > > I would have expected the result to be an Rle, which would be fairly > memory efficient. > > > However, often you don't need a single base resolution. I wanted to > suggest, if it is possible, to add a parameter by which one could have the > sliding window to slide by a user defined step, rather than always "step=1", > as it is now. Such that runsum(myRle, k=1e4, step = 1000) would return the > equivalent of a wig file, for each 10 kilobases of the genome, without > saturating the memory of the server. > > I tried with sum(Views(myRle, ir)), it is less memory intensive but it is > much slower. So that amelioration would give the best of both worlds, fast > and memory efficient. > > > Have you tried viewSums(Views(myRle, ir))? > > > kind regards, > > Arnaud Amzallag > Research Fellow > Mass general Cancer Center / Harvard Medical school > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

Coverage Annotation Cancer PROcess IRanges genomes Coverage Annotation Cancer PROcess • 1.5k views

ADD COMMENT • link updated 13.8 years ago by Hervé Pagès 16k • written 13.8 years ago by Arnaud Amzallag ▴ 100

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 10 hours ago

Seattle, WA, United States

Hi Michael, On 11-06-29 03:29 PM, Michael Lawrence wrote: > > > 2011/6/28 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Hi Michael, Arnaud, > > > On 11-05-25 08:21 AM, Arnaud Amzallag wrote: > > Thank you Michael, > > for some email filters reasons I saw your reply only now. > > I recon that in my case that would have been much smoother if > sum() would > call viewSums() by default and I agree that "It's more intuitive > to think of > an RleViews like an RleList rather than an IntegerList.". I > would support > that change. > > > I've recently made some changes in that direction in IRanges devel. > This subtle shift in the nature of an RleViews (from IntegerList > to RleList) has some deep consequences on the hierarchy of classes > in IRanges. The most important one is that the Views class doesn't > extend the IRanges class anymore. Views is now a direct subclass > of List. This means that you cannot directly manipulate a Views object > as if it was an IRanges object but you must extract its ranges (with > ranges()) first. > > > This is pretty unfortunate. We lose the ability to e.g. store the > coverage inside RangedData (because a ViewsList is a RangesList). Is it > really necessary for what was proposed? Because of the inheritance tree. When Views was a subclass of IRanges we had: RleViews -> Views -> IRanges -> IntegerList -> List but [[ on a Views object would not return an integer vector like it does on an IRanges object (and the reason IRanges is a child of IntegerList is precisely because of what [[ does on an IRanges object). So a Views object should not be seen as an IntegerList object, but just as a List object with an elementType that depends on the particular type of Views object (e.g. elementType="Rle" for RleViews). Then [[ on RleViews can return an Rle without contradiction. H. > > > This is a work-in-progress and a few things still need to be polished. > > Also I agree that we should just have "min", "max", "sum", "mean" etc > methods for Views objects. No need for viewMins, viewMaxs, viewSums etc > I'll change this. > > Cheers, > H. > > > Also it is possible that before I was summing the values of the > Rle and did > not notice the difference because my Rle was made of a lot of > very short Rle > lengths. > > Arnaud > > On Tue, May 10, 2011 at 8:44 AM, Michael > Lawrence<lawrence.michael at="" __gene.com=""> <mailto:lawrence.michael at="" gene.com=""> > > wrote: > > > Good to hear that helped. One might expect sum() to simply > call viewSums(), > but the semantics are a bit strange here. The reason sum() > works on Views is > that a Views is a Ranges and thus an IntegerList (where each > range encodes a > sequence of integers). The weird thing is that the elements > of a Views are > not the sequence of integers covered but rather the values > in the Rle. That > everything works as you expected is just a coincidence of > dispatch. > > For usability we should probably have max(), min(), and > sum() just use > viewMaxs, viewMins and viewSums. It's more intuitive to > think of an RleViews > like an RleList rather than an IntegerList. > > > On Sun, May 8, 2011 at 1:44 PM, Arnaud > Amzallag<arnaud.amzallag at="" __gmail.com=""> <mailto:arnaud.amzallag at="" gmail.com=""> > > wrote: > > > Thank you Michael, the function viewSums was exactly > what I needed ! > > 0.014 seconds for viewSums(Views(myrle, ir)) vs 54 > seconds for > sum(Views(myrle, ir)) on chr22, one sample. I use this > now instead of of > runsum, no problem of memory, and probably even faster. > for full the genome > on many samples that will surely help. Maybe I should > have read a bit more > about the Views. > > About the result of runsum, I did see a lot of memory > usage when I split > the process with mclapply. The result is indeed a Rle. > After looking closer, > the resuting Rle has much more runs that the original > one. That makes sense, > because runsum is a kind of smoothing function, and the > resulting signal has > much more levels than the original one. > > Kind regards, > > Arnaud > > On May 6, 2011, at 10:42 PM, Michael Lawrence wrote: > > > > On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag< > arnaud.amzallag at gmail.com > <mailto:arnaud.amzallag at="" gmail.com="">> wrote: > > Dear IRanges developers, > > runsum is a very fast and convenient function to > compute on Rle > coverages, for instance. However when it is run on > several chromosomes and > several samples, it can get very memory intensive. > For instance on human > chromosome 1, it outputs a vector of length 250 > millions, so for several > full genomes it is quickly billions of numbers in > memory. > > > I would have expected the result to be an Rle, which > would be fairly > memory efficient. > > > However, often you don't need a single base > resolution. I wanted to > suggest, if it is possible, to add a parameter by > which one could have the > sliding window to slide by a user defined step, > rather than always "step=1", > as it is now. Such that runsum(myRle, k=1e4, step = > 1000) would return the > equivalent of a wig file, for each 10 kilobases of > the genome, without > saturating the memory of the server. > > I tried with sum(Views(myRle, ir)), it is less > memory intensive but it is > much slower. So that amelioration would give the > best of both worlds, fast > and memory efficient. > > > Have you tried viewSums(Views(myRle, ir))? > > > kind regards, > > Arnaud Amzallag > Research Fellow > Mass general Cancer Center / Harvard Medical school > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > > > > [[alternative HTML version deleted]] > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 13.8 years ago Hervé Pagès 16k

0

Entering edit mode

2011/7/8 Hervé Pagès <hpages@fhcrc.org> > Hi Michael, > > On 11-06-29 03:29 PM, Michael Lawrence wrote: > >> >> >> 2011/6/28 Hervé Pagès <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> >> Hi Michael, Arnaud, >> >> >> On 11-05-25 08:21 AM, Arnaud Amzallag wrote: >> >> Thank you Michael, >> >> for some email filters reasons I saw your reply only now. >> >> I recon that in my case that would have been much smoother if >> sum() would >> call viewSums() by default and I agree that "It's more intuitive >> to think of >> an RleViews like an RleList rather than an IntegerList.". I >> would support >> that change. >> >> >> I've recently made some changes in that direction in IRanges devel. >> This subtle shift in the nature of an RleViews (from IntegerList >> to RleList) has some deep consequences on the hierarchy of classes >> in IRanges. The most important one is that the Views class doesn't >> extend the IRanges class anymore. Views is now a direct subclass >> of List. This means that you cannot directly manipulate a Views object >> as if it was an IRanges object but you must extract its ranges (with >> ranges()) first. >> >> >> This is pretty unfortunate. We lose the ability to e.g. store the >> coverage inside RangedData (because a ViewsList is a RangesList). Is it >> really necessary for what was proposed? >> > > Because of the inheritance tree. When Views was a subclass of IRanges we > had: > > RleViews -> Views -> IRanges -> IntegerList -> List > > but [[ on a Views object would not return an integer vector like it does > on an IRanges object (and the reason IRanges is a child of IntegerList > is precisely because of what [[ does on an IRanges object). > > So a Views object should not be seen as an IntegerList object, but just > as a List object with an elementType that depends on the particular > type of Views object (e.g. elementType="Rle" for RleViews). Then [[ on > RleViews can return an Rle without contradiction. > > What about breaking the IRanges -> IntegerList inheritance? Conceptually, it's nice to think of Ranges as being an IntegerList, but practically it's much nicer having a Views be a Ranges. It's convenient to think of a Views being a set of ranges with a subject vector, and being able to treat it transparently as such was a really nice feature of IRanges. This change will break a lot of code. I could see someone wanting to implement an IntegerList with a Ranges (when dealing with integer sequences), but what sort of practical use cases are there for treating a Ranges as an IntegerList? I admit it's not so nice to modify Ranges to accommodate Views. Can't we just live with the contradiction? At least as far as this thread is concerned, add a smarter sum() on Views was all that was needed. Michael H. > > >> >> This is a work-in-progress and a few things still need to be polished. >> >> Also I agree that we should just have "min", "max", "sum", "mean" etc >> methods for Views objects. No need for viewMins, viewMaxs, viewSums etc >> I'll change this. >> >> Cheers, >> H. >> >> >> Also it is possible that before I was summing the values of the >> Rle and did >> not notice the difference because my Rle was made of a lot of >> very short Rle >> lengths. >> >> Arnaud >> >> On Tue, May 10, 2011 at 8:44 AM, Michael >> Lawrence<lawrence.michael@__ge**ne.com <http:="" gene.com=""> >> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">> >> >> >> wrote: >> >> >> Good to hear that helped. One might expect sum() to simply >> call viewSums(), >> but the semantics are a bit strange here. The reason sum() >> works on Views is >> that a Views is a Ranges and thus an IntegerList (where each >> range encodes a >> sequence of integers). The weird thing is that the elements >> of a Views are >> not the sequence of integers covered but rather the values >> in the Rle. That >> everything works as you expected is just a coincidence of >> dispatch. >> >> For usability we should probably have max(), min(), and >> sum() just use >> viewMaxs, viewMins and viewSums. It's more intuitive to >> think of an RleViews >> like an RleList rather than an IntegerList. >> >> >> On Sun, May 8, 2011 at 1:44 PM, Arnaud >> Amzallag<arnaud.amzallag@__gma**il.com <http:="" gmail.com=""> >> <mailto:arnaud.amzallag@gmail.**com<arnaud.amzallag@gmail.com> >> > >> >> >> wrote: >> >> >> Thank you Michael, the function viewSums was exactly >> what I needed ! >> >> 0.014 seconds for viewSums(Views(myrle, ir)) vs 54 >> seconds for >> sum(Views(myrle, ir)) on chr22, one sample. I use this >> now instead of of >> runsum, no problem of memory, and probably even faster. >> for full the genome >> on many samples that will surely help. Maybe I should >> have read a bit more >> about the Views. >> >> About the result of runsum, I did see a lot of memory >> usage when I split >> the process with mclapply. The result is indeed a Rle. >> After looking closer, >> the resuting Rle has much more runs that the original >> one. That makes sense, >> because runsum is a kind of smoothing function, and the >> resulting signal has >> much more levels than the original one. >> >> Kind regards, >> >> Arnaud >> >> On May 6, 2011, at 10:42 PM, Michael Lawrence wrote: >> >> >> >> On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag< >> arnaud.amzallag@gmail.com >> <mailto:arnaud.amzallag@gmail.**com<arnaud.amzallag@gmail.com>>> >> wrote: >> >> >> Dear IRanges developers, >> >> runsum is a very fast and convenient function to >> compute on Rle >> coverages, for instance. However when it is run on >> several chromosomes and >> several samples, it can get very memory intensive. >> For instance on human >> chromosome 1, it outputs a vector of length 250 >> millions, so for several >> full genomes it is quickly billions of numbers in >> memory. >> >> >> I would have expected the result to be an Rle, which >> would be fairly >> memory efficient. >> >> >> However, often you don't need a single base >> resolution. I wanted to >> suggest, if it is possible, to add a parameter by >> which one could have the >> sliding window to slide by a user defined step, >> rather than always "step=1", >> as it is now. Such that runsum(myRle, k=1e4, step = >> 1000) would return the >> equivalent of a wig file, for each 10 kilobases of >> the genome, without >> saturating the memory of the server. >> >> I tried with sum(Views(myRle, ir)), it is less >> memory intensive but it is >> much slower. So that amelioration would give the >> best of both worlds, fast >> and memory efficient. >> >> >> Have you tried viewSums(Views(myRle, ir))? >> >> >> kind regards, >> >> Arnaud Amzallag >> Research Fellow >> Mass general Cancer Center / Harvard Medical school >> ______________________________**___________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> >> https://stat.ethz.ch/mailman/_**_listinfo/biocon ductor<https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" biocond="" uctor<https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> > >> Search the archives: >> http://news.gmane.org/gmane.__** >> science.biology.informatics.__**conductor<http: news.gmane.org="" gma="" ne.__science.biology.informatics.__conductor=""> >> <http: news.gmane.org="" gmane.**="">> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > >> >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> >> ______________________________**___________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> >> https://stat.ethz.ch/mailman/_**_listinfo/bioconductor<https :="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> > >> Search the archives: >> http://news.gmane.org/gmane.__**science.biology.informatics.__** >> conductor<http: news.gmane.org="" gmane.__science.biology.informatics="" .__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> > >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.8 years ago Michael Lawrence ★ 11k

0

Entering edit mode

On 11-07-08 09:58 AM, Michael Lawrence wrote: > > > 2011/7/8 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Hi Michael, > > On 11-06-29 03:29 PM, Michael Lawrence wrote: > > > > 2011/6/28 Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>> > > > Hi Michael, Arnaud, > > > On 11-05-25 08:21 AM, Arnaud Amzallag wrote: > > Thank you Michael, > > for some email filters reasons I saw your reply only now. > > I recon that in my case that would have been much > smoother if > sum() would > call viewSums() by default and I agree that "It's more > intuitive > to think of > an RleViews like an RleList rather than an IntegerList.". I > would support > that change. > > > I've recently made some changes in that direction in IRanges > devel. > This subtle shift in the nature of an RleViews (from IntegerList > to RleList) has some deep consequences on the hierarchy of > classes > in IRanges. The most important one is that the Views class > doesn't > extend the IRanges class anymore. Views is now a direct subclass > of List. This means that you cannot directly manipulate a > Views object > as if it was an IRanges object but you must extract its > ranges (with > ranges()) first. > > > This is pretty unfortunate. We lose the ability to e.g. store the > coverage inside RangedData (because a ViewsList is a > RangesList). Is it > really necessary for what was proposed? > > > Because of the inheritance tree. When Views was a subclass of IRanges we > had: > > RleViews -> Views -> IRanges -> IntegerList -> List > > but [[ on a Views object would not return an integer vector like it does > on an IRanges object (and the reason IRanges is a child of IntegerList > is precisely because of what [[ does on an IRanges object). > > So a Views object should not be seen as an IntegerList object, but just > as a List object with an elementType that depends on the particular > type of Views object (e.g. elementType="Rle" for RleViews). Then [[ on > RleViews can return an Rle without contradiction. > > > What about breaking the IRanges -> IntegerList inheritance? > Conceptually, it's nice to think of Ranges as being an IntegerList, but > practically it's much nicer having a Views be a Ranges. It's convenient > to think of a Views being a set of ranges with a subject vector, and > being able to treat it transparently as such was a really nice feature > of IRanges. This change will break a lot of code. Most of the times we will still be able to treat a Views object as an IRanges object because I re-implemented most of the methods in the IRanges interface to make them work on Views. It's a very light kind of re-implementation that basically delegates to the method for IRanges e.g.: setMethod("width", "Views", function(x) width(ranges(x))) I did this for a small subset of methods in the IRanges interface and so far it didn't seem to break anything (except the chipseq package, but it's easy to fix, will do). I found the price to pay for having a cleaner class hierarchy acceptable. > > I could see someone wanting to implement an IntegerList with a Ranges > (when dealing with integer sequences), but what sort of practical use > cases are there for treating a Ranges as an IntegerList? For the sake of clarity I simplified the inheritance diagram above but IRanges doesn't inherit directly from IntegerList, it does so via the Ranges class: IRanges -> Ranges -> IntegerList -> List Problem is there are other classes like PartitioningByEnd that inherit from Ranges that also benefit from being seen (and treated) as IntegerList objects (because of what [[ does on them). > > I admit it's not so nice to modify Ranges to accommodate Views. Can't we > just live with the contradiction? At least as far as this thread is > concerned, add a smarter sum() on Views was all that was needed. I actually started to modify Views before I saw this thread. Here is the original motivation: hpages at latitude:~$ svn log -r 55988 https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/IRanges ---------------------------------------------------------------------- -- r55988 | hpages at fhcrc.org | 2011-06-04 00:20:43 -0700 (Sat, 04 Jun 2011) | 10 lines Views is now a direct subclass of List and not a direct subclass of IRanges anymore (the ranges of the views are now stored in a new "ranges" slot which is of type IRanges). Before this change we had: RleViews -> Views -> IRanges -> Ranges -> IntegerList -> AtomicList -> List. So any RleViews object (e.g. 'character'-RleViews) was regarded as an IntegerList object which was conceptually wrong (and leaded to some schizophrenic decisions about what the elementType of the various Views objects should be). After this change we just have: RleViews -> Views -> List so the problem is gone. This change should be mostly transparent to the end user. ---------------------------------------------------------------------- -- H. > > Michael > > H. > > > > This is a work-in-progress and a few things still need to be > polished. > > Also I agree that we should just have "min", "max", "sum", > "mean" etc > methods for Views objects. No need for viewMins, viewMaxs, > viewSums etc > I'll change this. > > Cheers, > H. > > > Also it is possible that before I was summing the values > of the > Rle and did > not notice the difference because my Rle was made of a > lot of > very short Rle > lengths. > > Arnaud > > On Tue, May 10, 2011 at 8:44 AM, Michael > Lawrence<lawrence.michael at="" __ge__ne.com="" <http:="" gene.com=""> > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">> > > > wrote: > > > Good to hear that helped. One might expect sum() to > simply > call viewSums(), > but the semantics are a bit strange here. The reason > sum() > works on Views is > that a Views is a Ranges and thus an IntegerList > (where each > range encodes a > sequence of integers). The weird thing is that the > elements > of a Views are > not the sequence of integers covered but rather the > values > in the Rle. That > everything works as you expected is just a > coincidence of > dispatch. > > For usability we should probably have max(), min(), and > sum() just use > viewMaxs, viewMins and viewSums. It's more intuitive to > think of an RleViews > like an RleList rather than an IntegerList. > > > On Sun, May 8, 2011 at 1:44 PM, Arnaud > Amzallag<arnaud.amzallag at="" __gma__il.com=""> <http: gmail.com=""> > <mailto:arnaud.amzallag at="" gmail.__com=""> <mailto:arnaud.amzallag at="" gmail.com="">> > > > wrote: > > > Thank you Michael, the function viewSums was exactly > what I needed ! > > 0.014 seconds for viewSums(Views(myrle, ir)) vs 54 > seconds for > sum(Views(myrle, ir)) on chr22, one sample. I > use this > now instead of of > runsum, no problem of memory, and probably even > faster. > for full the genome > on many samples that will surely help. Maybe I > should > have read a bit more > about the Views. > > About the result of runsum, I did see a lot of > memory > usage when I split > the process with mclapply. The result is indeed > a Rle. > After looking closer, > the resuting Rle has much more runs that the > original > one. That makes sense, > because runsum is a kind of smoothing function, > and the > resulting signal has > much more levels than the original one. > > Kind regards, > > Arnaud > > On May 6, 2011, at 10:42 PM, Michael Lawrence wrote: > > > > On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag< > arnaud.amzallag at gmail.com <mailto:arnaud.amzallag at="" gmail.com=""> > <mailto:arnaud.amzallag at="" gmail.__com=""> <mailto:arnaud.amzallag at="" gmail.com="">>> wrote: > > > Dear IRanges developers, > > runsum is a very fast and convenient function to > compute on Rle > coverages, for instance. However when it is > run on > several chromosomes and > several samples, it can get very memory > intensive. > For instance on human > chromosome 1, it outputs a vector of length 250 > millions, so for several > full genomes it is quickly billions of > numbers in > memory. > > > I would have expected the result to be an Rle, which > would be fairly > memory efficient. > > > However, often you don't need a single base > resolution. I wanted to > suggest, if it is possible, to add a > parameter by > which one could have the > sliding window to slide by a user defined step, > rather than always "step=1", > as it is now. Such that runsum(myRle, k=1e4, > step = > 1000) would return the > equivalent of a wig file, for each 10 > kilobases of > the genome, without > saturating the memory of the server. > > I tried with sum(Views(myRle, ir)), it is less > memory intensive but it is > much slower. So that amelioration would give the > best of both worlds, fast > and memory efficient. > > > Have you tried viewSums(Views(myRle, ir))? > > > kind regards, > > Arnaud Amzallag > Research Fellow > Mass general Cancer Center / Harvard Medical > school > > ___________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > > https://stat.ethz.ch/mailman/____listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > Search the archives: > http://news.gmane.org/gmane.____science.biology.informatics. ____conductor > <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > > > > > [[alternative HTML version deleted]] > > > ___________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > > https://stat.ethz.ch/mailman/____listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > Search the archives: > http://news.gmane.org/gmane.____science.biology.informatics. ____conductor > <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.8 years ago Hervé Pagès 16k

0

Entering edit mode

2011/7/8 Hervé Pagès <hpages@fhcrc.org> > On 11-07-08 09:58 AM, Michael Lawrence wrote: > >> >> >> 2011/7/8 Hervé Pagès <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> >> Hi Michael, >> >> On 11-06-29 03:29 PM, Michael Lawrence wrote: >> >> >> >> 2011/6/28 Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> >> <mailto:hpages@fhcrc.org>>> >> >> >> Hi Michael, Arnaud, >> >> >> On 11-05-25 08:21 AM, Arnaud Amzallag wrote: >> >> Thank you Michael, >> >> for some email filters reasons I saw your reply only now. >> >> I recon that in my case that would have been much >> smoother if >> sum() would >> call viewSums() by default and I agree that "It's more >> intuitive >> to think of >> an RleViews like an RleList rather than an IntegerList.". I >> would support >> that change. >> >> >> I've recently made some changes in that direction in IRanges >> devel. >> This subtle shift in the nature of an RleViews (from >> IntegerList >> to RleList) has some deep consequences on the hierarchy of >> classes >> in IRanges. The most important one is that the Views class >> doesn't >> extend the IRanges class anymore. Views is now a direct >> subclass >> of List. This means that you cannot directly manipulate a >> Views object >> as if it was an IRanges object but you must extract its >> ranges (with >> ranges()) first. >> >> >> This is pretty unfortunate. We lose the ability to e.g. store the >> coverage inside RangedData (because a ViewsList is a >> RangesList). Is it >> really necessary for what was proposed? >> >> >> Because of the inheritance tree. When Views was a subclass of IRanges >> we >> had: >> >> RleViews -> Views -> IRanges -> IntegerList -> List >> >> but [[ on a Views object would not return an integer vector like it >> does >> on an IRanges object (and the reason IRanges is a child of IntegerList >> is precisely because of what [[ does on an IRanges object). >> >> So a Views object should not be seen as an IntegerList object, but just >> as a List object with an elementType that depends on the particular >> type of Views object (e.g. elementType="Rle" for RleViews). Then [[ on >> RleViews can return an Rle without contradiction. >> >> >> What about breaking the IRanges -> IntegerList inheritance? >> Conceptually, it's nice to think of Ranges as being an IntegerList, but >> practically it's much nicer having a Views be a Ranges. It's convenient >> to think of a Views being a set of ranges with a subject vector, and >> being able to treat it transparently as such was a really nice feature >> of IRanges. This change will break a lot of code. >> > > Most of the times we will still be able to treat a Views object as an > IRanges object because I re-implemented most of the methods in the > IRanges interface to make them work on Views. It's a very light kind > of re-implementation that basically delegates to the method for > IRanges e.g.: > > setMethod("width", "Views", function(x) width(ranges(x))) > > I did this for a small subset of methods in the IRanges interface and > so far it didn't seem to break anything (except the chipseq package, > but it's easy to fix, will do). > > Not in any of the packages in Bioc, but there's a lot of code out in the wild. Anyway, maybe we could come up with some sort of class union between Views and Ranges, so that data structures could store one or the other. We could also have a List implementation for that union. Then e.g. RangedData could use that for its 'ranges' slot. Btw, I appreciate the work you've done to clean up the hierarchy. Thanks, Michael > I found the price to pay for having a cleaner class hierarchy > acceptable. > > > >> I could see someone wanting to implement an IntegerList with a Ranges >> (when dealing with integer sequences), but what sort of practical use >> cases are there for treating a Ranges as an IntegerList? >> > > For the sake of clarity I simplified the inheritance diagram above > but IRanges doesn't inherit directly from IntegerList, it does so via > the Ranges class: > > IRanges -> Ranges -> IntegerList -> List > > Problem is there are other classes like PartitioningByEnd that inherit > from Ranges that also benefit from being seen (and treated) as > IntegerList objects (because of what [[ does on them). > > > >> I admit it's not so nice to modify Ranges to accommodate Views. Can't we >> just live with the contradiction? At least as far as this thread is >> concerned, add a smarter sum() on Views was all that was needed. >> > > I actually started to modify Views before I saw this thread. Here is the > original motivation: > > hpages@latitude:~$ svn log -r 55988 https://hedgehog.fhcrc.org/** > bioconductor/trunk/madman/**Rpacks/IRanges<https: hedgehog.fhcrc.or="" g="" bioconductor="" trunk="" madman="" rpacks="" iranges=""> > ------------------------------**------------------------------** > ------------ > r55988 | hpages@fhcrc.org | 2011-06-04 00:20:43 -0700 (Sat, 04 Jun 2011) | > 10 lines > > Views is now a direct subclass of List and not a direct subclass of IRanges > anymore (the ranges of the views are now stored in a new "ranges" slot > which is > of type IRanges). Before this change we had: RleViews -> Views -> IRanges > -> > Ranges -> IntegerList -> AtomicList -> List. So any RleViews object (e.g. > 'character'-RleViews) was regarded as an IntegerList object which was > conceptually wrong (and leaded to some schizophrenic decisions about what > the > elementType of the various Views objects should be). After this change we > just > have: RleViews -> Views -> List so the problem is gone. > This change should be mostly transparent to the end user. > > ------------------------------**------------------------------** > ------------ > > H. > > >> Michael >> >> H. >> >> >> >> This is a work-in-progress and a few things still need to be >> polished. >> >> Also I agree that we should just have "min", "max", "sum", >> "mean" etc >> methods for Views objects. No need for viewMins, viewMaxs, >> viewSums etc >> I'll change this. >> >> Cheers, >> H. >> >> >> Also it is possible that before I was summing the values >> of the >> Rle and did >> not notice the difference because my Rle was made of a >> lot of >> very short Rle >> lengths. >> >> Arnaud >> >> On Tue, May 10, 2011 at 8:44 AM, Michael >> Lawrence<lawrence.michael@__ge**__ne.com<http: ge__ne.com="">< >> http://gene.com> >> >> <mailto:lawrence.michael@gene.**__com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>> >> >> >> wrote: >> >> >> Good to hear that helped. One might expect sum() to >> simply >> call viewSums(), >> but the semantics are a bit strange here. The reason >> sum() >> works on Views is >> that a Views is a Ranges and thus an IntegerList >> (where each >> range encodes a >> sequence of integers). The weird thing is that the >> elements >> of a Views are >> not the sequence of integers covered but rather the >> values >> in the Rle. That >> everything works as you expected is just a >> coincidence of >> dispatch. >> >> For usability we should probably have max(), min(), and >> sum() just use >> viewMaxs, viewMins and viewSums. It's more intuitive to >> think of an RleViews >> like an RleList rather than an IntegerList. >> >> >> On Sun, May 8, 2011 at 1:44 PM, Arnaud >> Amzallag<arnaud.amzallag@__gma**__il.com<http: gma__il.com=""> >> <http: gmail.com=""> >> >> <mailto:arnaud.amzallag@gmail.**__com>> <mailto:arnaud.amzallag@gmail.**com <arnaud.amzallag@gmail.com="">>> >> >> >> wrote: >> >> >> Thank you Michael, the function viewSums was >> exactly >> what I needed ! >> >> 0.014 seconds for viewSums(Views(myrle, ir)) vs 54 >> seconds for >> sum(Views(myrle, ir)) on chr22, one sample. I >> use this >> now instead of of >> runsum, no problem of memory, and probably even >> faster. >> for full the genome >> on many samples that will surely help. Maybe I >> should >> have read a bit more >> about the Views. >> >> About the result of runsum, I did see a lot of >> memory >> usage when I split >> the process with mclapply. The result is indeed >> a Rle. >> After looking closer, >> the resuting Rle has much more runs that the >> original >> one. That makes sense, >> because runsum is a kind of smoothing function, >> and the >> resulting signal has >> much more levels than the original one. >> >> Kind regards, >> >> Arnaud >> >> On May 6, 2011, at 10:42 PM, Michael Lawrence >> wrote: >> >> >> >> On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag< >> arnaud.amzallag@gmail.com <mailto:arnaud.amzallag@gmail.**com<arnaud.amzallag@gmail.com> >> > >> <mailto:arnaud.amzallag@gmail.**__com>> <mailto:arnaud.amzallag@gmail.**com <arnaud.amzallag@gmail.com="">>>> >> wrote: >> >> >> Dear IRanges developers, >> >> runsum is a very fast and convenient function >> to >> compute on Rle >> coverages, for instance. However when it is >> run on >> several chromosomes and >> several samples, it can get very memory >> intensive. >> For instance on human >> chromosome 1, it outputs a vector of length 250 >> millions, so for several >> full genomes it is quickly billions of >> numbers in >> memory. >> >> >> I would have expected the result to be an Rle, >> which >> would be fairly >> memory efficient. >> >> >> However, often you don't need a single base >> resolution. I wanted to >> suggest, if it is possible, to add a >> parameter by >> which one could have the >> sliding window to slide by a user defined step, >> rather than always "step=1", >> as it is now. Such that runsum(myRle, k=1e4, >> step = >> 1000) would return the >> equivalent of a wig file, for each 10 >> kilobases of >> the genome, without >> saturating the memory of the server. >> >> I tried with sum(Views(myRle, ir)), it is less >> memory intensive but it is >> much slower. So that amelioration would give >> the >> best of both worlds, fast >> and memory efficient. >> >> >> Have you tried viewSums(Views(myRle, ir))? >> >> >> kind regards, >> >> Arnaud Amzallag >> Research Fellow >> Mass general Cancer Center / Harvard Medical >> school >> >> ______________________________**_____________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> >> https://stat.ethz.ch/mailman/_**___listinfo/bioconductor<htt ps:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<http="" s:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> > >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<http="" s:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >> >> Search the archives: >> http://news.gmane.org/gmane.__**__science.biology.informatics.** >> ____conductor<http: news.gmane.org="" gmane.____science.biology.infor="" matics.____conductor=""> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> > >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >> >> >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> >> ______________________________**_____________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> >> https://stat.ethz.ch/mailman/_**___listinfo/bioconductor<htt ps:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<http="" s:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> > >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<http="" s:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >> >> Search the archives: >> http://news.gmane.org/gmane.__**__science.biology.informatics.** >> ____conductor<http: news.gmane.org="" gmane.____science.biology.infor="" matics.____conductor=""> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> > >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.8 years ago Michael Lawrence ★ 11k

Login before adding your answer.