scanBam on BamSampler supposed to be fast?
2
1
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 29 days ago
Italy

dear All,

I'm trying to get random samples from a BAM file using scanBam on a BamSampler instance and was hoping that this will return me the results fast. I'm however waiting now for some hours for the call (see below) to finish.

BSampler <- BamSampler( bamfile, index=paste0( bamfile, ".bai" ), yieldSize=10, obeyQname=TRUE )
system.time(
    BamSample <- scanBam( BSampler, param=ScanBamParam( what=scanBamWhat(), tag=c( "NH", "HI" ) ) )
    )

With bamfile being the file name of the BAM file (which is about 10GB large).

Is there a faster way to fetch random entries from a large BAM file?

 

thanks in advance!

cheers, jo

my R-version:

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin14.0.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] Rsamtools_1.18.2     Biostrings_2.34.1    XVector_0.6.0       
[4] GenomicRanges_1.18.4 GenomeInfoDb_1.2.4   IRanges_2.0.1       
[7] S4Vectors_0.4.0      BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] bitops_1.0-6    zlibbioc_1.12.0

 

 

rsamtools bam • 2.5k views
ADD COMMENT
3
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States

Unfortunately BamSampler visits the  entire file (so does samtools view -s!) so will be expensive in terms of time; 'hours' seems excessive.

When you indicate yieldSize=10, you're asking R to read 10 records at a time, but probably yieldSize should be 1000000 or more, enough to easily fit into memory; if this is more than the number of reads you're interested in, sample the result.

By saying scanBamWhat() you are asking for all the data, but it would be more efficient to ask for just the information that is relevant maybe it's more useful to start with GenomicAlignments::readGAlignments() or import only the fields that you are interested in. EDIT: actually, it seems like the what=scanBamWhat() approach is ~10% faster than readGAlignments.

Using yieldSize=1000000 and readGAlignments() on a ~2Gb file takes about 70s for me, about 2x longer than samtools view -s (and without creating yet another BAM file!).

I do not know of another way to draw a random sample from the file; I think biocvizBase uses a trick where it uses the index to draw a representative selection of reads; I'm not sure how it's implemented, but the code is I think in esimateCoverage,BamFile-method.

BamSampler() was written before samtools view -s and  before paired-end reads were common; it doesn't handle paired-end reads. Here's a different, more general, solution. The GenomicFiles::reduceByYield() function can iterate through a file (e.g., a BAM file) using an arbitrary function to YIELD data from the file, reducing successive chunks by an arbitrary  REDUCE function. Use X=BamFile(), YIELD=readGAlignmentsList, MAP=identity, and the following REDUCE function to sample from successive pairs of aligned reads

REDUCEsampler <-
    function(sampleSize=1000000, verbose=FALSE)
{
    tot <- 0L
    function(x, y, ...) {
        if (length(x) < sampleSize)
            stop("expected yield of at lease sampleSize=", sampleSize)

        if (tot == 0L) {
            ## first time through
            tot <<- length(x)
            x <- x[sample(length(x), sampleSize)]
        }
        yld_n <- length(y)
        tot <<- tot + yld_n

        if (verbose)
            message("REDUCEsampler total=", tot)

        keep <- rbinom(1L, min(sampleSize, yld_n), yld_n / tot)
        i <- sample(sampleSize, keep)
        j <- sample(yld_n, keep)
        x[i] <- y[j]
        x
    }
}

You could draw the sample with

bf <- BamFile("...", yieldSize=1000000)
yield <- function(x)
    readGAlignmentsList(x, param=ScanBamParam(what=c( "qwidth", "mapq" ),
                             tag=c( "NH", "HI")))
​map <- identity
reduceByYield(bf, yield, map, REDUCEsampler(1000, TRUE))

 

ADD COMMENT
0
Entering edit mode

your suggestion worked well, with yieldSize=1e+6 and BamSample <- scanBam( BSampler, param=ScanBamParam( what=c( "qwidth", "mapq" ), tag=c( "NH", "HI" ) ) ) it took about 4 minutes.

How does the BamSampler handle read pairs? According to the samtools the -s should maintain read pairs in the output.

ADD REPLY
0
Entering edit mode

I updated my reply with suggestions for sampling from paired-end files.

ADD REPLY
0
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 29 days ago
Italy

acutally, looking at the source code for scanBam for BamSampler it looks like the function is indeed reading through the full file, which explains why it takes ages on my BAM file.

so I guess I'll use the

samtools view -s

option to subsample the large bam file to a smaller one.

 

cheers, jo

ADD COMMENT
0
Entering edit mode

Sampling 1000000 records took about 35s for the same 2Gb file that took 70s using BamSampler; timing for both samtools and BamSampler should be linear in file size.

ADD REPLY

Login before adding your answer.

Traffic: 608 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6