pooling for parallel hierarchical operations
2
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.1 years ago
United States
We often execute nested operations in parallel. For example, first by sample, then by chromosome. Fixed allocation of resources to each level will often result in waste. For example, if one sample finishes quickly, its CPUs are not available to help the other samples along. Perhaps the most expedient solution is to expand.grid() the hierarchy and create one job for every combination, i.e., flatten the hierarchy. A more ideal solution might be a pool of resources (cores) that are allocated more fluidly. Is there any sort of pooling system for R? I know that the parallel package supports the declaration of resources in cluster objects, but there is no central manager. This is a general R question, but it's worth discussing in the context of how we can make better use of parallelism in the low-level infrastructure, which would cause these hierarchies to arise. It's also relevant to the discussion of specifying parallelization modes or strategies. Pools themselves could be hierarchical and heterogeneous (hosts, cores). Declaring available resources is fairly straight-forward. Deciding how to use them is context dependent and requires user control. Michael [[alternative HTML version deleted]]
Infrastructure Infrastructure • 1.3k views
ADD COMMENT
0
Entering edit mode
Malcolm Cook ★ 1.6k
@malcolm-cook-6293
Last seen 5 months ago
United States
Michael, Have you seen http://cran.r-project.org/web/packages/doRedis/index.html ?? If you take a look and come across a description of internals/architecture, please share.... Cheers, ~Malcolm > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Michael Lawrence > Sent: Wednesday, November 14, 2012 8:41 AM > To: Bioconductor List > Subject: [BioC] pooling for parallel hierarchical operations > > We often execute nested operations in parallel. For example, first by > sample, then by chromosome. Fixed allocation of resources to each level > will often result in waste. For example, if one sample finishes quickly, > its CPUs are not available to help the other samples along. Perhaps the > most expedient solution is to expand.grid() the hierarchy and create one > job for every combination, i.e., flatten the hierarchy. A more ideal > solution might be a pool of resources (cores) that are allocated more > fluidly. Is there any sort of pooling system for R? I know that the > parallel package supports the declaration of resources in cluster objects, > but there is no central manager. This is a general R question, but it's > worth discussing in the context of how we can make better use of > parallelism in the low-level infrastructure, which would cause these > hierarchies to arise. It's also relevant to the discussion of specifying > parallelization modes or strategies. Pools themselves could be hierarchical > and heterogeneous (hosts, cores). Declaring available resources is fairly > straight-forward. Deciding how to use them is context dependent and > requires user control. > > Michael > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
I hadn't seen that, thanks. It looks like a nice mechanism for passing messages and sharing data between multiple clients. It would be interesting if someone created an R environment based on a dynamic object tables- based backend that shared data via redis. I don't see anything about managing of resource pools though. Michael On Wed, Nov 14, 2012 at 8:20 AM, Cook, Malcolm <mec@stowers.org> wrote: > Michael, > > Have you seen http://cran.r-project.org/web/packages/doRedis/index.html?? > > If you take a look and come across a description of > internals/architecture, please share.... > > Cheers, > > ~Malcolm > > > > -----Original Message----- > > From: bioconductor-bounces@r-project.org [mailto: > bioconductor-bounces@r-project.org] On Behalf Of Michael Lawrence > > Sent: Wednesday, November 14, 2012 8:41 AM > > To: Bioconductor List > > Subject: [BioC] pooling for parallel hierarchical operations > > > > We often execute nested operations in parallel. For example, first by > > sample, then by chromosome. Fixed allocation of resources to each level > > will often result in waste. For example, if one sample finishes quickly, > > its CPUs are not available to help the other samples along. Perhaps the > > most expedient solution is to expand.grid() the hierarchy and create one > > job for every combination, i.e., flatten the hierarchy. A more ideal > > solution might be a pool of resources (cores) that are allocated more > > fluidly. Is there any sort of pooling system for R? I know that the > > parallel package supports the declaration of resources in cluster > objects, > > but there is no central manager. This is a general R question, but it's > > worth discussing in the context of how we can make better use of > > parallelism in the low-level infrastructure, which would cause these > > hierarchies to arise. It's also relevant to the discussion of specifying > > parallelization modes or strategies. Pools themselves could be > hierarchical > > and heterogeneous (hosts, cores). Declaring available resources is fairly > > straight-forward. Deciding how to use them is context dependent and > > requires user control. > > > > Michael > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
@martin-morgan-1513
Last seen 1 day ago
United States
On 11/14/2012 6:40 AM, Michael Lawrence wrote: > We often execute nested operations in parallel. For example, first by > sample, then by chromosome. Fixed allocation of resources to each level > will often result in waste. For example, if one sample finishes quickly, > its CPUs are not available to help the other samples along. Perhaps the > most expedient solution is to expand.grid() the hierarchy and create one > job for every combination, i.e., flatten the hierarchy. A more ideal > solution might be a pool of resources (cores) that are allocated more > fluidly. Is there any sort of pooling system for R? I know that the > parallel package supports the declaration of resources in cluster objects, > but there is no central manager. This is a general R question, but it's > worth discussing in the context of how we can make better use of > parallelism in the low-level infrastructure, which would cause these > hierarchies to arise. It's also relevant to the discussion of specifying > parallelization modes or strategies. Pools themselves could be hierarchical > and heterogeneous (hosts, cores). Declaring available resources is fairly > straight-forward. Deciding how to use them is context dependent and > requires user control. Hi Michael -- Don't really have an answer for you but (a) sounds like you're looking for a scheduler, with the idea that the 'workers' have a deque of tasks that they are responsible for, but with some kind of collaboration between workers to balance tasks. I don't think the user should have (or have to) influence on the scheduler, it mostly just does the right thing. I think it would be good to develop scheduler(s) orthogonal to the parallel algorithm (lapply, pvec, map/reduce, etc). I've started a BiocParallel package in Bioconductor's svn and on github https://github.com/Bioconductor/BiocParallel so that might provide a place to focus this development; I'd encourage use of github and it's social coding as the primary means for development at this time. Martin > > Michael > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
ADD COMMENT
0
Entering edit mode
Thanks for setting this up. I think we might want to look into how other high-level languages have approached these issues. The user will need some high-level control. For example, only the user is going to know how much memory a job will consume. I'm sure there are heuristics and simplifying assumptions/constraints that will go a long way towards autonomy though. Michael On Wed, Nov 14, 2012 at 12:32 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 11/14/2012 6:40 AM, Michael Lawrence wrote: > >> We often execute nested operations in parallel. For example, first by >> sample, then by chromosome. Fixed allocation of resources to each level >> will often result in waste. For example, if one sample finishes quickly, >> its CPUs are not available to help the other samples along. Perhaps the >> most expedient solution is to expand.grid() the hierarchy and create one >> job for every combination, i.e., flatten the hierarchy. A more ideal >> solution might be a pool of resources (cores) that are allocated more >> fluidly. Is there any sort of pooling system for R? I know that the >> parallel package supports the declaration of resources in cluster objects, >> but there is no central manager. This is a general R question, but it's >> worth discussing in the context of how we can make better use of >> parallelism in the low-level infrastructure, which would cause these >> hierarchies to arise. It's also relevant to the discussion of specifying >> parallelization modes or strategies. Pools themselves could be >> hierarchical >> and heterogeneous (hosts, cores). Declaring available resources is fairly >> straight-forward. Deciding how to use them is context dependent and >> requires user control. >> > > Hi Michael -- Don't really have an answer for you but (a) sounds like > you're looking for a scheduler, with the idea that the 'workers' have a > deque of tasks that they are responsible for, but with some kind of > collaboration between workers to balance tasks. I don't think the user > should have (or have to) influence on the scheduler, it mostly just does > the right thing. I think it would be good to develop scheduler(s) > orthogonal to the parallel algorithm (lapply, pvec, map/reduce, etc). > > I've started a BiocParallel package in Bioconductor's svn and on github > > https://github.com/**Bioconductor/BiocParallel<https: github.com="" bioconductor="" biocparallel=""> > > so that might provide a place to focus this development; I'd encourage use > of github and it's social coding as the primary means for development at > this time. > > Martin > > > >> Michael >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > > -- > Dr. Martin Morgan, PhD > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On Wed, Nov 14, 2012 at 5:06 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > Thanks for setting this up. I think we might want to look into how other > high-level languages have approached these issues. The user will need some > high-level control. For example, only the user is going to know how much > memory a job will consume. I'm sure there are heuristics and simplifying > assumptions/constraints that will go a long way towards autonomy though. I've got ten bucks on Michael coming back 2-3 weeks from now with his own bioakkaR library: http://akka.io Who's in? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY

Login before adding your answer.

Traffic: 791 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6