I am running an R script that uses the package MethylMix to download and preprocess all the available methylation data sets from TCGA. However, when I try to process the 450K breast cancer methylation data set (size ~13GB), I get a "Cannot allocate vector of size 12.8 GB" error.
I am running R 3.4.0 on 64-bit x86_64_pc-linux-gnu using my school's computing cluster, and each node has the following properties:
- Dual Socket
- Xeon E5-2690 v3 (Haswell) : 12 cores per socket (24 cores/node), 2.6 GHz
- 64 GB DDR4-2133 (8 x 8GB dual rank x8 DIMMS)
- No local disk
- Hyperthreading Enabled - 48 threads (logical CPUs) per node
so it seems as though there should be enough memory for this operation. The operating system is Linux, so I thought that R will just use all available memory, unlike on Windows? And checking the process memory using ulimit returns "unlimited." I am not sure where the problem lies. My script is a loop that iterates over all cancers available on TCGA, if that makes any difference.
sessionInfo():
R version 3.4.0 (2017-04-21)
Platform: x86_64_pc-linux-gnu (64-bit)
Matrix products: default
BLAS/LAPACK: /opt/apps/intel/16.0.1.150/compilers_and_libraries_2016.1.150/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so
attached base packages: stats, graphics grDevices, utils, datasets, methods, base
loaded via a namespace (and not attached): compiler_3.4.0
Code (inside for() loop):
# Download methylation data.
METdirectories <- tryCatch(
{
Download_DNAmethylation(i, paste0(targetDirectory, "/Methylation/"))
}, warning = function(w) {
# For warnings, write them to the output file.
cat(paste("Warning in cancer", i, "when downloading methylation data:", conditionMessage(w)), file=logfile, append=TRUE, sep = "\n")
}, error = function(e) {
# For errors, write them to the output file and then skip to the next cancer.
cat(paste("Error in cancer", i, "when downloading methylation data:", conditionMessage(e)), file=logfile, append=TRUE, sep = "\n")
return(NULL)
}, finally = {
# If everything went all right, make a note of that in the output file.
cat(paste("Successfully downloaded methylation data for cancer", i), file=logfile, append=TRUE, sep = "\n")
}
)
if(is.null(METdirectories)) next
# Process methylation data.
METProcessedData <- tryCatch(
{
Preprocess_DNAmethylation(i, METdirectories)
}, warning = function(w) {
# For warnings, write them to the output file.
cat(paste("Warning in cancer", i, "when processing methylation data:", conditionMessage(w)), file=logfile, append=TRUE, sep = "\n")
}, error = function(e) {
# For errors, write them to the output file and then skip to the next cancer.
cat(paste("Error in cancer", i, "when processing methylation data:", conditionMessage(e)), file=logfile, append=TRUE, sep = "\n")
return(NULL)
}, finally = {
# If everything went all right, make a note of that in the output file.
cat(paste("Successfully processed methylation data for cancer", i), file=logfile, append=TRUE, sep = "\n")
}
)
if(is.null(METProcessedData)) next
# Save methylation processed data.
saveRDS(METProcessedData, file=paste0(paste0(targetDirectory, "/Methylation/"), "MET_", i, "_Processed.rds"))
Out of curiosity I thought I'd run this on a machine with 1TB of RAM to see how much is actually required. My first comment is that I can see why you'd want to do it on a cluster, after 24hrs I've only manged to process 10% of the breast cancer dataset. More pertinently, here's the last line of the processing output, followed by a call to
gc()
after I stopped the process:You can see that the maximum amount of memory used is ~80GB, hence why you're running out of room on your cluster nodes.