Rsamtools/GenomicAlignments segmentation faults
1
1
Entering edit mode
Graeme Thorn ▴ 10
@graeme-thorn-8013
Last seen 4.2 years ago
Barts Cancer Institute, QMUL, London, UK

I'm trying to use Rsamtools and GenomicAlignments to read in a 45G WES bam file chromosome by chromosome and save to an R (.rds) object using the following code:

param <- ScanBamParam(flag = scanBamFlag(isDuplicate = FALSE,
                                         isSecondaryAlignment = FALSE,
                                         isUnmappedQuery = FALSE),
                      mapqFilter = 30,
                      which = GRanges(paste0("chr",i)), IRanges(1,3e8))
galp.file <- file.path(galpdir, paste0(sample, "_chr", i,".rds"))

galp <- readGAlignmentPairs(bamfile, param = param)
saveRDS(galp, galp.file)

but the script keeps failing (I'm running on a cluster) with a segmentation fault. I have previously tried to read the whole thing in at once, and this required 256GB of memory to do so, so I decided to split it and read chromosome by chromosome (hence the above code).

However this is still causing an issue with memory use (it needs more than 32GB of memory just to read in one chromosome).

I am invoking the script using Rscript <script-name>.R --args <args> from a bash job submission script.

Is there a better way of doing the above which doesn't need quite so much memory?

Rsamtools GenomicAlignments • 1.1k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 15 hours ago
United States

EDIT oops, got my questions confused, but the answer is almost the same whether working with large VCF or large BAM files. If you are processing the entire BAM file, do it in chunks along the lines of

file <- system.file("extdata", "ex1.bam", package="Rsamtools")
bamfile = BamFile(file, yieldSize = 1000)
open(bamfile)
repeat {
    bam <- readGAlignments(bamfile)
    if (length(bam) == 0) break
    message("tick")
}
close(bamfile)

with the 'work' being done where message("tick") is. yieldSize can probably be in the 100,000's, and readGAlignments() can be replaced by readGAlignmentPairs().

It might pay, if there are many variants that you will filter-out, to pre-filter the files using filterBam(). I'm a little rusty on things, but one problem with reading paired alignments is that one has to retain unpaired reads until their mate(s) are encountered in the file. So I think you can make progress by ensuring that only proper pairs are in the filtered output. If you provide a more completely reproducible example (e.g., indicating a specific file that you're trying to process) I might be able to provide additional help.

ADD COMMENT
0
Entering edit mode

I need to save all the pairs to a file for a specific chromosome (originally for the whole genome, but chr-by-chr would be better and give me an opportunity to checkpoint it), so would catenating the outputs that come from each chunk be fine?

ADD REPLY

Login before adding your answer.

Traffic: 615 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6