I am attempting to randomly sample a bam file the most efficient way possible. I understand this can be done using the GenomicAlignments::reduceByYield, and I have done so:
individual_subsampling <- function(filename, sample_size) { bf <- BamFile(filename, yieldSize=1000000) yield <- function(x) { readGAlignments(x) } map <- identity gr <- GRanges(reduceByYield(bf, yield, map, REDUCEsampler(sample_size,FALSE))) return(gr) }
My end goal, however, it to only obtain countOverlap() information for ~1/20th of the sample. AKA, if the bam file consists of 34 million reads, my goal is to randomly sample 2 million reads and count where they overlap with a supplied GRange object. While I can run the reduceByYield as shown above, it's pulling out way more information that I need and in turn takes a decent amount of time. I would much rather supply the countOverlaps() as the map function, however this results in being unable to apply the REDUCEsampler option. The code below does work, however it has to count all of the reads:
counting_overlaps <- function(filename) { bf <- BamFile(filename, yieldSize = 1000000) yield <- function(x) { readGAlignments(x) } map <- function(x) { gr_overlapRegions <- overlap_regions_generator("GR", "ALL") ##This is just defining my bins gr <- GRanges(x) countOverlaps(gr_overlapRegions, gr) } reduce = `+` final_data <- reduceByYield(bf, yield, map, reduce) return(final_data) }
To put this in perspective, the individual_subsampling applied to a bam file with 34 million reads and sample_size = 1*10^6 took an elapsed time of 93 seconds. Applied to the same original bam file, counting_overlaps() took 71 seconds. I would love to apply something like counting_overlaps() therefore, but subsample simultaneously to make the process more efficient. Is there any way to do this?
Any help or pointers would be fantastic. Thanks!