Question

Memory management in R

0

Entering edit mode

mrk.carty ▴ 30

@mrkcarty-7442

Last seen 9.5 years ago

United States

Hi everyone,

I've written a script to compute p-values based on model estimates from a hurdle regression using pscl package. I'm using the predict() function from the stats package to get estimated mean and zero probabilities need to calculate the p-values. the features are stored in big.memory matrix x which is on the order 1e6 observations. I'm running on 3.1.1 on torque scheduler, and my job is getting killed because I'm exceeding the memory limit. How can I reduce the memory footprint without having to increase the physical memory?

func <- function(counts,size,mean,phat)
    mapply(p.zanegbin, q=counts, size=size, munb=mean, pobs=phat)
time <- system.time({
    mu <- predict(fit, newdata = as.data.frame(x[,]),
                  dispersion = fit$theta**(-1))
    phat <- predprob(fit, newdata=as.data.frame(x[,]))[,1]
    
    idx <- chunk(seq_len(nrows),cores)
    pvals <- foreach(i=1:cores,.combine=c) %dopar% {
        func(x[idx[[i]],3], fit$theta, mu[idx[[i]]], phat[idx[[i]]])
    }
    adjusted <- p.adjust(pvals,method='fdr')
})[3]/60

=>> PBS: job killed: pvmem exceeded limit 6442450944

Terminated
-bash-4.1$ show(sprintf('Time required to estimate additional parameters : %3.2f mins',stime))
-bash: syntax error near unexpected token `sprintf'
-bash-4.1$
-bash-4.1$
qsub: job 2943092.mskcc-fe1.local completed
bash-4.1$

stats • 2.5k views

ADD COMMENT • link updated 10.1 years ago by Martin Morgan 25k • written 10.1 years ago by mrk.carty ▴ 30

score 2 · Answer 1 · 2015-03-30

This is not an easy question to answer. Measuring/controlling memory usage with software you have not written can be fairly challenging. I would say that you need to isolate the memory hungry elements of the computations you want to do, and understand why and how they use the memory that they do

http://adv-r.had.co.nz/memory.html

has useful information on measurement. The other thing you need to do is look at the environment and consider whether you are using it in a reasonable way. Are you "doing too much" concurrently? Could a partly serial/partly concurrent approach live safely within the 6GB that you seem to have? Could you do more with "out of memory" approaches? I have found the ff/ffbase packages useful for mixed concurrent/out-of-memory programming. If your mu and phat vectors consume significant memory and are being replicated in shared memory in the %dopar%, perhaps you can manage them out of memory. Finally, there are facilities in the BatchJobs package for getting information on memory consumption of distributed jobs. Using this package might help you to factor the task, distribute it in less hungry subtasks, and reduce the distributed results to the desired summary.

It has crossed my mind that we should form a special interest group concerning concurrent computing in Bioconductor, with a particular concern for documenting resource consumption for key tasks in a systematic way. For example, in the man page for findMateAlignments in GenomicAlignments, Herve Pages provides some estimates that I think could be very useful for many functions that we use.

\subsection{Timing and memory requirement of the pairing algorithm}{

The estimated timings and memory requirements on a modern Linux system are

(those numbers may vary depending on your hardware and OS):

\preformatted{

  nb of alignments |         time | required memory

  -----------------+--------------+----------------

        8 millions |       28 sec |          1.4 GB

       16 millions |       58 sec |          2.8 GB

       32 millions |        2 min |          5.6 GB

       64 millions | 4 min 30 sec |         11.2 GB

}

This is for a \link{GAlignments} object coming from a file with an

"average nb of records per unique QNAME" of 2.04. A value of 2 (which means

the file contains only primary reads) is optimal for the pairing algorithm.

A greater value, say > 3, will significantly degrade its performance.

An easy way to avoid this degradation is to load only primary alignments

by setting the \code{isSecondaryAlignment} flag to \code{FALSE} in

ScanBamParam().  See examples in \code{?\link{readGAlignmentPairs}} for how

to do this.

score 0 · Answer 2 · 2015-03-30

As an add on to Vincent's comment, I think it's important to reiterate the advice to look at BatchJobs to parallelize your code (actually, I guess the "official" answer should be to look at BiocParallel, but I'd use it for the BatchJobs backend, so just start with BatchJobs to keep your life simple for now).

Why BatchJobs?

To be honest it's hard for me to get a handle on what exactly is failing for you, but your use of foreach suggests that your problem is embarrassingly parallel, and you are trying to run all of the parallel steps on the same CPU node and you're running out of memory for your process.

Given that observation, you should be able to "quite easily" get the same result out of a code that is run on multiple nodes (jobs on the cluster), and splitting it up this way, each task will be an isolated run of R and your (apparent) 6GB memory limit for your task will apply to one run instead of all of your parallel bits running through foreach.

Another option is to see if you can request more memory for your current process that's running on your cluster and keep your code as it is. Could you rig up your torque script/request to so that the scheduler gives you a node with a higher memory limit?