High-performance Bioconductor experiments

0

Entering edit mode

A.J. Rossini ▴ 810

@aj-rossini-209

Last seen 10.6 years ago

"Michael Benjamin" <msb1129@bellsouth.net> writes: > Progress update (summarized from my forum for such matters at > http://www.theschedule.net/forum/gforum.cgi?forum=20&do=forum_view): > > Briefly, I created a four-node cluster out of Pentium-III boxes and > Debian Linux/openMosix. I saw no significant performance boost of > ReadAffy or expresso using the set of 165 .CEL files from Harvard. None > of the processes migrated, as they say in the world of high- performance > computing. R.bin runs in one process, and everything it does seems to > stay in that process. No real opportunity for parallelization here, at > least not on openMosix. > > I'd like to analyze these chips in a reasonable amount of time, without > paying Dell $45,000 for 4-Xeon SMP server. > > I worry what we'll do with 1,000 .CEL files. The analytical techniques > work well, but pretty slow even if your amp "goes to 11." > > Any thoughts? Explicitly parallelize the routine. OpenMOSIX is nice, but it's still not a production environment with R. That's why Michael Li and I wrote RPVM/RSPRNG as well as worked with Luke Tierney on SNOW. The tools are there, but someone has to do the programming. That means that you can hire someone with the money you won't spend on software or hardware, or you can wait. That being said, the 4-way Xeon server isn't going to help with parallelization of a single process, and you'd get the same work done with a remote execution shell (i.e. firing off R BATCH or using Emacs/ESS-Elsewhere on other machines. The data-shareing/locale problem is an interesting one that will need to be solved. Not sure how we'll go about that. See our tech report for an anecdotal example of how one can naively end up twice as slow on the parallel system (later pathological examples that I've constructed show slowness increasing a bit in the number of processors) due to sending data "over the wire" being machines. best, -tony -- rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}

GO Cancer PROcess GO Cancer PROcess • 1.4k views

ADD COMMENT • link 21.4 years ago A.J. Rossini ▴ 810

0

Entering edit mode

Warnes, Gregory R ▴ 460

@warnes-gregory-r-43

Last seen 10.6 years ago

> From: rossini@blindglobe.net [mailto:rossini@blindglobe.net] > > "Michael Benjamin" <msb1129@bellsouth.net> writes: > ... > > I'd like to analyze these chips in a reasonable amount of > time, without > > paying Dell $45,000 for 4-Xeon SMP server. > > > > I worry what we'll do with 1,000 .CEL files. The > analytical techniques > > work well, but pretty slow even if your amp "goes to 11." > > > > Any thoughts? > > Explicitly parallelize the routine. OpenMOSIX is nice, but it's still > not a production environment with R. I've done some work to parallelize some things here at Pfizer. At the moment, I've concentrated on the step of applying a statistical model to all of the genes and have code that parallelizes this process using RPVM + SNOW + a custom parallel 'apply' function. I get a speedup that looks perfectly linear for this step. As for reading in and normalizing the chips, I would suggest using RPVM + SNOW to spread out the reading-in of the cel files (which in my experience is the most time consuming step), then combine the results into a single object, which you can then normalize and scale. The normalizing and scaling can, of course also be split up across processors. At one point I had preliminary code to do this, but that was a year ago and the affy code has changed quite a bit since then. -G LEGAL NOTICE\ Unless expressly stated otherwise, this messag...{{dropped}}

ADD COMMENT • link 21.4 years ago Warnes, Gregory R ▴ 460

0

Entering edit mode

This is the most exciting, and cryptic, part of the message... > which you can then normalize and scale. The normalizing and scaling > can, of course also be split up across processors. How? I was able to use snow to split up the CEL file readings--it's actually not that hard. cl<-makeCluster(2) Readaffy<-function(x){ Data<-ReadAffy(x) Return(Data) } Data<-clusterApply(cl, filenames,Readaffy) I find that pvm is not as easy to use as openMosix, because it doesn't autodiscover (or does it?!). My idea is to make a multinode cluster on the base computer using RPVM, then have openMosix farm out the processes instead of relying on pvm to do that. In other words, run RPVM on a single pvm node, multi-openMosix node. I'll try that experiment tomorrow.

ADD REPLY • link 21.4 years ago Michael Benjamin ▴ 120

0

Entering edit mode

This might be irrelevant or already well known, and if so please disregard. But I feel that several different issues are being discussed here. It is my cursory understanding that altough paralelization (pvm) and openMosix can coexist peacefully paralelization of R might not be a trivial issue; load balancing, however, can be achieved using LVS (linux virtual server ---http://www.linux-vs.org/), so separate R processes could be started on different CPUs, and then the result (the .RData?) combined, which might be along the lines of what Greg suggested; LVS and openMosix also seem to get along fine. In our cluster, and given the mixed results we have found with the migration of R, we will probably use LVS (instead of relying on openMosix migrating R processes). For Michael Benjamin's situation a possible kluge (which I have not tried) would be to use LVS to run several R processes (i.e., as many processes as disjunct subsets of cel files), and then combine the output. R. On Friday 12 December 2003 04:13, Michael Benjamin wrote: > This is the most exciting, and cryptic, part of the message... > > > which you can then normalize and scale. The normalizing and scaling > > can, of course also be split up across processors. > > How? > > I was able to use snow to split up the CEL file readings--it's actually > not that hard. > > cl<-makeCluster(2) > Readaffy<-function(x){ > Data<-ReadAffy(x) > Return(Data) > } > Data<-clusterApply(cl, filenames,Readaffy) > > I find that pvm is not as easy to use as openMosix, because it doesn't > autodiscover (or does it?!). My idea is to make a multinode cluster on > the base computer using RPVM, then have openMosix farm out the processes > instead of relying on pvm to do that. > > In other words, run RPVM on a single pvm node, multi-openMosix node. > I'll try that experiment tomorrow. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz PGP KeyID: 0xE89B3462 (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)

ADD REPLY • link 21.4 years ago Ramon Diaz ★ 1.1k

0

Entering edit mode

A.J. Rossini ▴ 810

@aj-rossini-209

Last seen 10.6 years ago

"Michael Benjamin" <msb1129@bellsouth.net> writes: > In other words, run RPVM on a single pvm node, multi-openMosix node. > I'll try that experiment tomorrow. Good luck with guaranteeing migration. I've not been able to do that recently with R. best, -tony -- rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}

ADD COMMENT • link 21.4 years ago A.J. Rossini ▴ 810

Login before adding your answer.