Main Question
What computational setups do you use for running EPIC array preprocessing pipelines?
Specifically:
- Where do you run your pipeline?
- e.g. Personal PC, local network, HPC using scheduler like SLURM, etc.
- What configuration do you use?
- Nodes, cores, CPU, memory per CPU (or equivalent)
- What kinds of projects do you work on:
- Number of samples, common tasks, significant bottlenecks worth mentioning.
Some background
I'm preprocessing EPIC array data on 1800 samples using RStudio. I'm running the pipeline in Amazon Cloud Computing (AWS) through a service called Ronin. This allows me to customize my machine by processor, number of virtual CPUs, and the total memory available.
Nevertheless, even with a large compute optimized machine (32 CPUs with 256GiB memory), loading the .idats, creating PC scores, normalizing, etc. is still very slow. Part of this seems to be because some packages (e.g. minfi
) are not taking advantage of parallelization, so some steps are only drawing on one CPU. I feel like there has to be a better way.
I'm looking to improve my pipeline times and avoid crashing/hangups. I thought this community might have thoughts on this. I've looked into some of the "big" versions of common packages bigmelon
, meffil
but don't know anybody who has used these.