processing BamFiles - streaming vs not streaming

0

Entering edit mode

Stefanie ▴ 360

@stefanie-5192

Last seen 10.7 years ago

Dear list, I have a question regarding processing of large bamfiles (such as reading in via readGAlignemnts or computing the coverage). I know about the option of iterative processing such as shown in the example below. mybam = open(Bamfile(bamfile, yieldSize = 2000000)) gAln <- GAlignments() while(length(chunk <- readGAlignmentsFromBam(mybam))){ gAln <- c(gAln,chunk) } close(mybam) Obviously the efficiency of iterating depends on the (i) file-size of the bam file and (ii) the available memory. Can I somehow pinpoint (e.g. file-size, number of alignments, memory requirements) when it is more efficient ( = faster and memory requirements are feasible) to process the bam-file in one batch or, alternatively, do it in an iterative manner? Best, Stefanie [[alternative HTML version deleted]]

PROcess PROcess • 992 views

ADD COMMENT • link updated 11.4 years ago by Michael Lawrence ★ 11k • written 11.4 years ago by Stefanie ▴ 360

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.4 years ago

United States

On Tue, Dec 3, 2013 at 2:16 AM, Stefanie Tauber < stefanie.tauber@univie.ac.at> wrote: > Dear list, > > I have a question regarding processing of large bamfiles (such as reading > in via readGAlignemnts or computing the coverage). > I know about the option of iterative processing such as shown in the > example below. > > > mybam = open(Bamfile(bamfile, yieldSize = 2000000)) > > gAln <- GAlignments() > > while(length(chunk <- readGAlignmentsFromBam(mybam))){ > gAln <- c(gAln,chunk) > } > close(mybam) > > Note that this example largely defeats the purpose of iteration, because there is very little reduction and depending on how much duplication is caused by c(), it is probably less efficient than reading the data all at once. > > Obviously the efficiency of iterating depends on the (i) file-size of the > bam file and (ii) the available memory. > > Can I somehow pinpoint (e.g. file-size, number of alignments, memory > requirements) when it is more efficient ( = faster and memory > requirements are feasible) to process the bam-file in one batch or, > alternatively, do it in an iterative manner? > > The point of iteration is to restrict resource consumption at any one point in time, with each iteration summarizing the data, so that the end result is of manageable size. There is overhead to each iteration, mostly due to the I/O and other system calls, as well as the R evaluator. Thus, one strategy is to increase the size of each iteration (and reduce the number of iterations) until resource consumption is maximized without exceeding the limits. It would be interesting to see how memory memory is consumed per GAlignments record. This would probably mostly vary by the complexity of the alignments, i.e., the length of the CIGAR. So the biggest difference is probably between DNA-seq and RNA-seq. I'll actually perform this analysis over the next couple of days, because I'm working on a paper related to this. > > Best, > Stefanie > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.4 years ago Michael Lawrence ★ 11k

Login before adding your answer.