Varied performance of readGappedAlignmentPairs function
1
0
Entering edit mode
Pan Du ▴ 10
@pan-du-5978
Last seen 10.1 years ago
Hi Herve I am using readGappedAlignmentPairs to retrieve the paired read information from BAM files. I found the performance of the function changed a lot from sample to sample, and it seems not related with the size of BAM files. Also the increase of running time is not linear with the increase of transcript number, from which we want to retrieve information. Michael suggests it may be related with how readGappedAlignmentPairs function responds to weird alignments. I want to know whether you also observed similar problems and whether there is a solution for it. Thanks! Pan
• 544 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 hour ago
Seattle, WA, United States
Hi Pan, Sorry for the delay in answering this. On 06/05/2013 11:46 PM, Pan Du wrote: > Hi Herve > > I am using readGappedAlignmentPairs to retrieve the paired read information from BAM files. I found the performance of the function changed a lot from sample to sample, and it seems not related with the size of BAM files. > Also the increase of running time is not linear with the increase of transcript number, from which we want to retrieve information. Michael suggests it may be related with how readGappedAlignmentPairs function responds to weird alignments. I want to know whether you also observed similar problems and whether there is a solution for it. The time of the pairing algorithm (implemented in findMateAlignment()) depends of course on the number of records in the BAM file, but also on the "average nb of records per unique QNAME". You can quickly get this value with the quickCountBam() utility: > library(RNAseqData.HNRNPC.bam.chr14) > bamfile <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L] > quickCountBam(bamfile, mainGroupsOnly=TRUE) group | nb of | nb of | mean / max of | records | unique | records per records | in group | QNAMEs | unique QNAME All records........................ A | 800484 | 393300 | 2.04 / 10 o template has single segment.... S | 0 | 0 | NA / NA o template has multiple segments. M | 800484 | 393300 | 2.04 / 10 - first segment.............. F | 400242 | 393300 | 1.02 / 5 - last segment............... L | 400242 | 393300 | 1.02 / 5 - other segment.............. O | 0 | 0 | NA / NA The value I'm talking about is the first value in the last column on the 3rd line (M group): 2.04. With paired-end reads, this value is always >= 2. If the aligner only reported primary alignments, it should be exactly 2. It's > 2 only if secondary alignments were also reported (up to 5 alignments per read for this BAM file). A value of 2 (i.e. only primary reads) is optimal for the pairing algorithm. A greater value, say > 3 will significantly degrade its performance. An easy way to avoid this degradation is to load only primary alignments by setting the appropriate flag in ScanBamParam(). Anyway, I just spent some time optimizing findMateAlignment() a little bit and the timings are slightly better now (2x to 5x faster or more, depending on the size of the file): nb of alignments | time | required memory -----------------+--------------+---------------- 8 millions | 32 sec | 1.7 GB 16 millions | 65 sec | 3.3 GB 32 millions | 2 min 25 sec | 6.5 GB 64 millions | 5 min 40 sec | 13.0 GB This is with Rsamtools 1.13.19 (BioC devel) which should become available via biocLite() in the next 36 hours or so. I should also mention that Martin and Val have plans to make readGappedAlignmentPairs() work on a BamFile object with a user-specified yieldSize. This will allow processing the file by chunk (like it's already possible with readGappedAlignments()). Cheers, H. > Thanks! > > Pan > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
The main problem with the performance of this function is that it appears to have quadratic complexity. The yield-size approach would mitigate that, but I'm left wondering why it has those characteristics. Just R memory management? Michael On Fri, Jun 21, 2013 at 2:09 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Pan, > > Sorry for the delay in answering this. > > > On 06/05/2013 11:46 PM, Pan Du wrote: > >> Hi Herve >> >> I am using readGappedAlignmentPairs to retrieve the paired read >> information from BAM files. I found the performance of the function changed >> a lot from sample to sample, and it seems not related with the size of BAM >> files. >> Also the increase of running time is not linear with the increase of >> transcript number, from which we want to retrieve information. Michael >> suggests it may be related with how readGappedAlignmentPairs function >> responds to weird alignments. I want to know whether you also observed >> similar problems and whether there is a solution for it. >> > > The time of the pairing algorithm (implemented in findMateAlignment()) > depends of course on the number of records in the BAM file, but also > on the "average nb of records per unique QNAME". You can quickly get > this value with the quickCountBam() utility: > > > library(RNAseqData.HNRNPC.bam.**chr14) > > bamfile <- RNAseqData.HNRNPC.bam.chr14_**BAMFILES[1L] > > quickCountBam(bamfile, mainGroupsOnly=TRUE) > group | nb of | nb of | mean / max > of | records | unique | records per > records | in group | QNAMEs | unique QNAME > All records.......................**. A | 800484 | 393300 | 2.04 / 10 > o template has single segment.... S | 0 | 0 | NA / NA > o template has multiple segments. M | 800484 | 393300 | 2.04 / 10 > - first segment.............. F | 400242 | 393300 | 1.02 / 5 > - last segment............... L | 400242 | 393300 | 1.02 / 5 > - other segment.............. O | 0 | 0 | NA / NA > > The value I'm talking about is the first value in the last column on > the 3rd line (M group): 2.04. With paired-end reads, this value is > always >= 2. If the aligner only reported primary alignments, it should > be exactly 2. It's > 2 only if secondary alignments were also reported > (up to 5 alignments per read for this BAM file). > > A value of 2 (i.e. only primary reads) is optimal for the pairing > algorithm. A greater value, say > 3 will significantly degrade its > performance. An easy way to avoid this degradation is to load only > primary alignments by setting the appropriate flag in ScanBamParam(). > > Anyway, I just spent some time optimizing findMateAlignment() a little > bit and the timings are slightly better now (2x to 5x faster or more, > depending on the size of the file): > > nb of alignments | time | required memory > -----------------+------------**--+---------------- > 8 millions | 32 sec | 1.7 GB > 16 millions | 65 sec | 3.3 GB > 32 millions | 2 min 25 sec | 6.5 GB > 64 millions | 5 min 40 sec | 13.0 GB > > This is with Rsamtools 1.13.19 (BioC devel) which should become > available via biocLite() in the next 36 hours or so. > > I should also mention that Martin and Val have plans to make > readGappedAlignmentPairs() work on a BamFile object with a > user-specified yieldSize. This will allow processing the file > by chunk (like it's already possible with readGappedAlignments()). > > Cheers, > H. > > > Thanks! >> >> Pan >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Michael, Pan, On 06/21/2013 05:13 AM, Michael Lawrence wrote: > The main problem with the performance of this function is that it > appears to have quadratic complexity. The yield-size approach would > mitigate that, but I'm left wondering why it has those characteristics. > Just R memory management? > Anyway, I just spent some time optimizing findMateAlignment() a little > bit and the timings are slightly better now (2x to 5x faster or more, > depending on the size of the file): > > nb of alignments | time | required memory > -----------------+------------__--+---------------- > 8 millions | 32 sec | 1.7 GB > 16 millions | 65 sec | 3.3 GB > 32 millions | 2 min 25 sec | 6.5 GB > 64 millions | 5 min 40 sec | 13.0 GB > Doesn't look quadratic with respect to the size of the file. But it is quadratic with respect to the "average nb of records per unique QNAME". The above timings are for an GAlignments object coming from a file where this value is 2.04 (close to the optimal value of 2). If this value is 3, expect those timings to be multiplied by 3. If it's 4, by 6x. If it's 5, by 10. If it's 6, by 15. I don't think it has much to do with R memory management. It's more the way the algo works: alignments are first grouped by QNAME, then within each group, all possible pairs are examined. So if there are 6 alignments in a group, there are 15 pairs to examine. It's a very bold algo and it could certainly be improved. But as I said earlier, all this can be avoided by setting the isNotPrimaryRead flag (flag 0x100) to FALSE in ScanBamParam(). That should *in theory* bring the "average nb of records per unique QNAME" down to 2. Then the complexity is almost linear with respect to the size of the BAM file (actually it's more something like n x log(n) because of the grouping by QNAME that is based on qsort()). However I've seen some aligners not setting properly the 0x100 flag for secondary alignments so using the isNotPrimaryRead=FALSE does not always bring the "average nb of records per unique QNAME" down to 2. If you guys have BAM files where this value is 2 (or is > 2 but you use isNotPrimaryRead=FALSE) and you still observe abnormal timings (after you've upgraded to Rsamtools 1.13.19), please send them to me (or make them available somewhere) and I'll have a look at it. Thanks! H. > This is with Rsamtools 1.13.19 (BioC devel) which should become > available via biocLite() in the next 36 hours or so. > > I should also mention that Martin and Val have plans to make > readGappedAlignmentPairs() work on a BamFile object with a > user-specified yieldSize. This will allow processing the file > by chunk (like it's already possible with readGappedAlignments()). > > Cheers, > H. > > > Thanks! > > Pan > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6