Question

scanBam on BamSampler supposed to be fast?

1

Entering edit mode

Johannes Rainer ★ 2.1k

@johannes-rainer-6987

Last seen 6 months ago

Italy

dear All,

I'm trying to get random samples from a BAM file using scanBam on a BamSampler instance and was hoping that this will return me the results fast. I'm however waiting now for some hours for the call (see below) to finish.

BSampler <- BamSampler( bamfile, index=paste0( bamfile, ".bai" ), yieldSize=10, obeyQname=TRUE )
system.time(
    BamSample <- scanBam( BSampler, param=ScanBamParam( what=scanBamWhat(), tag=c( "NH", "HI" ) ) )
    )

With bamfile being the file name of the BAM file (which is about 10GB large).

Is there a faster way to fetch random entries from a large BAM file?

thanks in advance!

cheers, jo

my R-version:

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin14.0.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] Rsamtools_1.18.2     Biostrings_2.34.1    XVector_0.6.0       
[4] GenomicRanges_1.18.4 GenomeInfoDb_1.2.4   IRanges_2.0.1       
[7] S4Vectors_0.4.0      BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] bitops_1.0-6    zlibbioc_1.12.0

rsamtools bam • 2.6k views

ADD COMMENT • link updated 10.3 years ago by Martin Morgan 25k • written 10.3 years ago by Johannes Rainer ★ 2.1k

0

Entering edit mode

Johannes Rainer ★ 2.1k

@johannes-rainer-6987

Last seen 6 months ago

Italy

acutally, looking at the source code for scanBam for BamSampler it looks like the function is indeed reading through the full file, which explains why it takes ages on my BAM file.

so I guess I'll use the

samtools view -s

option to subsample the large bam file to a smaller one.

cheers, jo

ADD COMMENT • link 10.3 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

Sampling 1000000 records took about 35s for the same 2Gb file that took 70s using BamSampler; timing for both samtools and BamSampler should be linear in file size.

ADD REPLY • link 10.3 years ago Martin Morgan 25k

score 3 · Accepted Answer · 2015-01-20

Unfortunately BamSampler visits the entire file (so does samtools view -s!) so will be expensive in terms of time; 'hours' seems excessive.

When you indicate yieldSize=10, you're asking R to read 10 records at a time, but probably yieldSize should be 1000000 or more, enough to easily fit into memory; if this is more than the number of reads you're interested in, sample the result.

By saying scanBamWhat() you are asking for all the data, but it would be more efficient to ask for just the information that is relevant maybe it's more useful to start with GenomicAlignments::readGAlignments() or import only the fields that you are interested in. EDIT: actually, it seems like the what=scanBamWhat() approach is ~10% faster than readGAlignments.

Using yieldSize=1000000 and readGAlignments() on a ~2Gb file takes about 70s for me, about 2x longer than samtools view -s (and without creating yet another BAM file!).

I do not know of another way to draw a random sample from the file; I think biocvizBase uses a trick where it uses the index to draw a representative selection of reads; I'm not sure how it's implemented, but the code is I think in esimateCoverage,BamFile-method.

BamSampler() was written before samtools view -s and before paired-end reads were common; it doesn't handle paired-end reads. Here's a different, more general, solution. The GenomicFiles::reduceByYield() function can iterate through a file (e.g., a BAM file) using an arbitrary function to YIELD data from the file, reducing successive chunks by an arbitrary REDUCE function. Use X=BamFile(), YIELD=readGAlignmentsList, MAP=identity, and the following REDUCE function to sample from successive pairs of aligned reads

REDUCEsampler <-
    function(sampleSize=1000000, verbose=FALSE)
{
    tot <- 0L
    function(x, y, ...) {
        if (length(x) < sampleSize)
            stop("expected yield of at lease sampleSize=", sampleSize)

        if (tot == 0L) {
            ## first time through
            tot <<- length(x)
            x <- x[sample(length(x), sampleSize)]
        }
        yld_n <- length(y)
        tot <<- tot + yld_n

        if (verbose)
            message("REDUCEsampler total=", tot)

        keep <- rbinom(1L, min(sampleSize, yld_n), yld_n / tot)
        i <- sample(sampleSize, keep)
        j <- sample(yld_n, keep)
        x[i] <- y[j]
        x
    }
}

You could draw the sample with

bf <- BamFile("...", yieldSize=1000000)
yield <- function(x)
    readGAlignmentsList(x, param=ScanBamParam(what=c( "qwidth", "mapq" ),
                             tag=c( "NH", "HI")))
map <- identity
reduceByYield(bf, yield, map, REDUCEsampler(1000, TRUE))