Question

Rsamtools/GenomicAlignments segmentation faults

1

Entering edit mode

Graeme Thorn ▴ 10

@graeme-thorn-8013

Last seen 4.5 years ago

Barts Cancer Institute, QMUL, London, UK

I'm trying to use Rsamtools and GenomicAlignments to read in a 45G WES bam file chromosome by chromosome and save to an R (.rds) object using the following code:

param <- ScanBamParam(flag = scanBamFlag(isDuplicate = FALSE,
                                         isSecondaryAlignment = FALSE,
                                         isUnmappedQuery = FALSE),
                      mapqFilter = 30,
                      which = GRanges(paste0("chr",i)), IRanges(1,3e8))
galp.file <- file.path(galpdir, paste0(sample, "_chr", i,".rds"))

galp <- readGAlignmentPairs(bamfile, param = param)
saveRDS(galp, galp.file)

but the script keeps failing (I'm running on a cluster) with a segmentation fault. I have previously tried to read the whole thing in at once, and this required 256GB of memory to do so, so I decided to split it and read chromosome by chromosome (hence the above code).

However this is still causing an issue with memory use (it needs more than 32GB of memory just to read in one chromosome).

I am invoking the script using Rscript <script-name>.R --args <args> from a bash job submission script.

Is there a better way of doing the above which doesn't need quite so much memory?

Rsamtools GenomicAlignments • 1.2k views

ADD COMMENT • link updated 5.1 years ago by Martin Morgan 25k • written 5.1 years ago by Graeme Thorn ▴ 10

score 0 · Answer 1 · 2020-04-07

EDIT oops, got my questions confused, but the answer is almost the same whether working with large VCF or large BAM files. If you are processing the entire BAM file, do it in chunks along the lines of

file <- system.file("extdata", "ex1.bam", package="Rsamtools")
bamfile = BamFile(file, yieldSize = 1000)
open(bamfile)
repeat {
    bam <- readGAlignments(bamfile)
    if (length(bam) == 0) break
    message("tick")
}
close(bamfile)

with the 'work' being done where message("tick") is. yieldSize can probably be in the 100,000's, and readGAlignments() can be replaced by readGAlignmentPairs().

It might pay, if there are many variants that you will filter-out, to pre-filter the files using filterBam(). I'm a little rusty on things, but one problem with reading paired alignments is that one has to retain unpaired reads until their mate(s) are encountered in the file. So I think you can make progress by ensuring that only proper pairs are in the filtered output. If you provide a more completely reproducible example (e.g., indicating a specific file that you're trying to process) I might be able to provide additional help.