Question

How to handle big data sets (>10GB) in Bioconductor packages?

0

Entering edit mode

perdedorium • 0

@perdedorium-15654

Last seen 6.4 years ago

Houston, TX

I am running an R script that uses the package MethylMix to download and preprocess all the available methylation data sets from TCGA. However, when I try to process the 450K breast cancer methylation data set (size ~13GB), I get a "Cannot allocate vector of size 12.8 GB" error.

I am running R 3.4.0 on 64-bit x86_64_pc-linux-gnu using my school's computing cluster, and each node has the following properties:

Dual Socket
Xeon E5-2690 v3 (Haswell) : 12 cores per socket (24 cores/node), 2.6 GHz
64 GB DDR4-2133 (8 x 8GB dual rank x8 DIMMS)
No local disk
Hyperthreading Enabled - 48 threads (logical CPUs) per node

so it seems as though there should be enough memory for this operation. The operating system is Linux, so I thought that R will just use all available memory, unlike on Windows? And checking the process memory using ulimit returns "unlimited." I am not sure where the problem lies. My script is a loop that iterates over all cancers available on TCGA, if that makes any difference.

sessionInfo():
R version 3.4.0 (2017-04-21)
Platform: x86_64_pc-linux-gnu (64-bit)
Matrix products: default
BLAS/LAPACK: /opt/apps/intel/16.0.1.150/compilers_and_libraries_2016.1.150/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so
attached base packages: stats, graphics grDevices, utils, datasets, methods, base
loaded via a namespace (and not attached): compiler_3.4.0

Code (inside for() loop):

# Download methylation data.
  METdirectories <- tryCatch(
    {
      Download_DNAmethylation(i, paste0(targetDirectory, "/Methylation/"))
    }, warning = function(w) {
      # For warnings, write them to the output file.
      cat(paste("Warning in cancer", i, "when downloading methylation data:", conditionMessage(w)), file=logfile, append=TRUE, sep = "\n")
    }, error = function(e) {
      # For errors, write them to the output file and then skip to the next cancer.
      cat(paste("Error in cancer", i, "when downloading methylation data:", conditionMessage(e)), file=logfile, append=TRUE, sep = "\n")
      return(NULL)
    }, finally = {
      # If everything went all right, make a note of that in the output file.
      cat(paste("Successfully downloaded methylation data for cancer", i), file=logfile, append=TRUE, sep = "\n")
    }
  )
  if(is.null(METdirectories)) next

# Process methylation data.
  METProcessedData <- tryCatch(
    {
      Preprocess_DNAmethylation(i, METdirectories)
    }, warning = function(w) {
      # For warnings, write them to the output file.
      cat(paste("Warning in cancer", i, "when processing methylation data:", conditionMessage(w)), file=logfile, append=TRUE, sep = "\n")
    }, error = function(e) {
      # For errors, write them to the output file and then skip to the next cancer.
      cat(paste("Error in cancer", i, "when processing methylation data:", conditionMessage(e)), file=logfile, append=TRUE, sep = "\n")
      return(NULL)
    }, finally = {
      # If everything went all right, make a note of that in the output file.
      cat(paste("Successfully processed methylation data for cancer", i), file=logfile, append=TRUE, sep = "\n")
    }
  )
  if(is.null(METProcessedData)) next
  # Save methylation processed data.
  saveRDS(METProcessedData, file=paste0(paste0(targetDirectory, "/Methylation/"), "MET_", i, "_Processed.rds"))

big data methylmix memory problem • 1.2k views

ADD COMMENT • link updated 6.5 years ago by Martin Morgan 25k • written 6.5 years ago by perdedorium • 0

score 0 · Answer 1 · 2018-09-05

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 7 weeks ago

United States

Because R uses 'copy-on-change' semantics, rather than reference-based semantics, it will often make several copies of data during, e.g., function calls. A rule of thumb, that I have no basis for, is that R will typically use 4x the memory of the largest object, so it is in some ways not surprising that you run out of memory here.

It could well be that the function you're calling is written so that it uses more memory than required; hopefully the maintainer will respond here, profile their code, and arrive at a more efficient implementation.

ADD COMMENT • link 6.5 years ago Martin Morgan 25k

1

Entering edit mode

Out of curiosity I thought I'd run this on a machine with 1TB of RAM to see how much is actually required. My first comment is that I can see why you'd want to do it on a cluster, after 24hrs I've only manged to process 10% of the breast cancer dataset. More pertinently, here's the last line of the processing output, followed by a call to gc() after I stopped the process:

Starting batch 1 of 33Starting batch 2 of 33Starting batch 3 of 33
> gc()
             used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells    1545225   82.6    4636290   247.7  351943442 18795.9
Vcells 1112069371 8484.5 3511821818 26793.1 8096548467 61771.8

You can see that the maximum amount of memory used is ~80GB, hence why you're running out of room on your cluster nodes.

ADD REPLY • link 6.5 years ago Mike Smith ★ 6.6k