Question

How to randomly sample bam files obtaining a subset of information (ex: binning locations)

1

Entering edit mode

victoriaccorey ▴ 10

@victoriaccorey-11035

Last seen 8.8 years ago

I am attempting to randomly sample a bam file the most efficient way possible. I understand this can be done using the GenomicAlignments::reduceByYield, and I have done so:

individual_subsampling <- function(filename, sample_size) {
  bf <- BamFile(filename, yieldSize=1000000)
  yield <- function(x) {
    readGAlignments(x)
  }
  map <- identity
  gr <- GRanges(reduceByYield(bf, yield, map, REDUCEsampler(sample_size,FALSE)))
  return(gr)
}

My end goal, however, it to only obtain countOverlap() information for ~1/20th of the sample. AKA, if the bam file consists of 34 million reads, my goal is to randomly sample 2 million reads and count where they overlap with a supplied GRange object. While I can run the reduceByYield as shown above, it's pulling out way more information that I need and in turn takes a decent amount of time. I would much rather supply the countOverlaps() as the map function, however this results in being unable to apply the REDUCEsampler option. The code below does work, however it has to count all of the reads:

counting_overlaps <- function(filename) {
  bf <- BamFile(filename, yieldSize = 1000000)
  yield <- function(x) {
    readGAlignments(x)
  }
  map <- function(x) {
    gr_overlapRegions <- overlap_regions_generator("GR", "ALL") ##This is just defining my bins
    gr <- GRanges(x)
    countOverlaps(gr_overlapRegions, gr)
  }
  reduce = `+`
  final_data <- reduceByYield(bf, yield, map, reduce)
  return(final_data)
}

To put this in perspective, the individual_subsampling applied to a bam file with 34 million reads and sample_size = 1*10^6 took an elapsed time of 93 seconds. Applied to the same original bam file, counting_overlaps() took 71 seconds. I would love to apply something like counting_overlaps() therefore, but subsample simultaneously to make the process more efficient. Is there any way to do this?

Any help or pointers would be fantastic. Thanks!

genomicfiles rsamtools • 1.8k views

ADD COMMENT • link updated 8.8 years ago by Martin Morgan 25k • written 8.8 years ago by victoriaccorey ▴ 10

score 0 · Answer 1 · 2016-07-01

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 3 months ago

United States

You could do a full count (e.g., via summarizeOverlaps()), and then sample from the counts -- for some vector of counts x, you'd like to draw multinomial deviates rmultinom(1, 0.2 * sum(x), x / sum(x)). This will be relatively efficient -- the bam file gets input but immediately reduced to a count vector, which is small.

ADD COMMENT • link 8.8 years ago Martin Morgan 25k