Unfortunately BamSampler visits the entire file (so does samtools view -s!) so will be expensive in terms of time; 'hours' seems excessive.
When you indicate yieldSize=10, you're asking R to read 10 records at a time, but probably yieldSize should be 1000000 or more, enough to easily fit into memory; if this is more than the number of reads you're interested in, sample the result.
By saying scanBamWhat() you are asking for all the data, but it would be more efficient to ask for just the information that is relevant maybe it's more useful to start with GenomicAlignments::readGAlignments() or import only the fields that you are interested in. EDIT: actually, it seems like the what=scanBamWhat() approach is ~10% faster than readGAlignments.
Using yieldSize=1000000 and readGAlignments() on a ~2Gb file takes about 70s for me, about 2x longer than samtools view -s (and without creating yet another BAM file!).
I do not know of another way to draw a random sample from the file; I think biocvizBase uses a trick where it uses the index to draw a representative selection of reads; I'm not sure how it's implemented, but the code is I think in esimateCoverage,BamFile-method.
BamSampler() was written before samtools view -s and before paired-end reads were common; it doesn't handle paired-end reads. Here's a different, more general, solution. The GenomicFiles::reduceByYield() function can iterate through a file (e.g., a BAM file) using an arbitrary function to YIELD data from the file, reducing successive chunks by an arbitrary REDUCE function. Use X=BamFile(), YIELD=readGAlignmentsList, MAP=identity, and the following REDUCE function to sample from successive pairs of aligned reads
REDUCEsampler <-
function(sampleSize=1000000, verbose=FALSE)
{
tot <- 0L
function(x, y, ...) {
if (length(x) < sampleSize)
stop("expected yield of at lease sampleSize=", sampleSize)
if (tot == 0L) {
## first time through
tot <<- length(x)
x <- x[sample(length(x), sampleSize)]
}
yld_n <- length(y)
tot <<- tot + yld_n
if (verbose)
message("REDUCEsampler total=", tot)
keep <- rbinom(1L, min(sampleSize, yld_n), yld_n / tot)
i <- sample(sampleSize, keep)
j <- sample(yld_n, keep)
x[i] <- y[j]
x
}
}
You could draw the sample with
bf <- BamFile("...", yieldSize=1000000)
yield <- function(x)
readGAlignmentsList(x, param=ScanBamParam(what=c( "qwidth", "mapq" ),
tag=c( "NH", "HI")))
map <- identity
reduceByYield(bf, yield, map, REDUCEsampler(1000, TRUE))
your suggestion worked well, with
yieldSize=1e+6
andBamSample <- scanBam( BSampler, param=ScanBamParam( what=c( "qwidth", "mapq" ), tag=c( "NH", "HI" ) ) )
it took about 4 minutes.How does the BamSampler handle read pairs? According to the samtools the -s should maintain read pairs in the output.
I updated my reply with suggestions for sampling from paired-end files.