I'm trying to use Rsamtools and GenomicAlignments to read in a 45G WES bam file chromosome by chromosome and save to an R (.rds) object using the following code:
param <- ScanBamParam(flag = scanBamFlag(isDuplicate = FALSE,
isSecondaryAlignment = FALSE,
isUnmappedQuery = FALSE),
mapqFilter = 30,
which = GRanges(paste0("chr",i)), IRanges(1,3e8))
galp.file <- file.path(galpdir, paste0(sample, "_chr", i,".rds"))
galp <- readGAlignmentPairs(bamfile, param = param)
saveRDS(galp, galp.file)
but the script keeps failing (I'm running on a cluster) with a segmentation fault. I have previously tried to read the whole thing in at once, and this required 256GB of memory to do so, so I decided to split it and read chromosome by chromosome (hence the above code).
However this is still causing an issue with memory use (it needs more than 32GB of memory just to read in one chromosome).
I am invoking the script using Rscript <script-name>.R --args <args>
from a bash job submission script.
Is there a better way of doing the above which doesn't need quite so much memory?
I need to save all the pairs to a file for a specific chromosome (originally for the whole genome, but chr-by-chr would be better and give me an opportunity to checkpoint it), so would catenating the outputs that come from each chunk be fine?