Hi Herve
I am using readGappedAlignmentPairs to retrieve the paired read
information from BAM files. I found the performance of the function
changed a lot from sample to sample, and it seems not related with the
size of BAM files.
Also the increase of running time is not linear with the increase of
transcript number, from which we want to retrieve information. Michael
suggests it may be related with how readGappedAlignmentPairs function
responds to weird alignments. I want to know whether you also observed
similar problems and whether there is a solution for it.
Thanks!
Pan
Hi Pan,
Sorry for the delay in answering this.
On 06/05/2013 11:46 PM, Pan Du wrote:
> Hi Herve
>
> I am using readGappedAlignmentPairs to retrieve the paired read
information from BAM files. I found the performance of the function
changed a lot from sample to sample, and it seems not related with the
size of BAM files.
> Also the increase of running time is not linear with the increase
of transcript number, from which we want to retrieve information.
Michael suggests it may be related with how readGappedAlignmentPairs
function responds to weird alignments. I want to know whether you also
observed similar problems and whether there is a solution for it.
The time of the pairing algorithm (implemented in findMateAlignment())
depends of course on the number of records in the BAM file, but also
on the "average nb of records per unique QNAME". You can quickly get
this value with the quickCountBam() utility:
> library(RNAseqData.HNRNPC.bam.chr14)
> bamfile <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L]
> quickCountBam(bamfile, mainGroupsOnly=TRUE)
group | nb of | nb of | mean /
max
of | records | unique | records
per
records | in group | QNAMEs | unique
QNAME
All records........................ A | 800484 | 393300 | 2.04 /
10
o template has single segment.... S | 0 | 0 | NA /
NA
o template has multiple segments. M | 800484 | 393300 | 2.04 /
10
- first segment.............. F | 400242 | 393300 | 1.02 /
5
- last segment............... L | 400242 | 393300 | 1.02 /
5
- other segment.............. O | 0 | 0 | NA /
NA
The value I'm talking about is the first value in the last column on
the 3rd line (M group): 2.04. With paired-end reads, this value is
always >= 2. If the aligner only reported primary alignments, it
should
be exactly 2. It's > 2 only if secondary alignments were also reported
(up to 5 alignments per read for this BAM file).
A value of 2 (i.e. only primary reads) is optimal for the pairing
algorithm. A greater value, say > 3 will significantly degrade its
performance. An easy way to avoid this degradation is to load only
primary alignments by setting the appropriate flag in ScanBamParam().
Anyway, I just spent some time optimizing findMateAlignment() a little
bit and the timings are slightly better now (2x to 5x faster or more,
depending on the size of the file):
nb of alignments | time | required memory
-----------------+--------------+----------------
8 millions | 32 sec | 1.7 GB
16 millions | 65 sec | 3.3 GB
32 millions | 2 min 25 sec | 6.5 GB
64 millions | 5 min 40 sec | 13.0 GB
This is with Rsamtools 1.13.19 (BioC devel) which should become
available via biocLite() in the next 36 hours or so.
I should also mention that Martin and Val have plans to make
readGappedAlignmentPairs() work on a BamFile object with a
user-specified yieldSize. This will allow processing the file
by chunk (like it's already possible with readGappedAlignments()).
Cheers,
H.
> Thanks!
>
> Pan
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
The main problem with the performance of this function is that it
appears
to have quadratic complexity. The yield-size approach would mitigate
that,
but I'm left wondering why it has those characteristics. Just R memory
management?
Michael
On Fri, Jun 21, 2013 at 2:09 AM, Hervé Pagès <hpages@fhcrc.org> wrote:
> Hi Pan,
>
> Sorry for the delay in answering this.
>
>
> On 06/05/2013 11:46 PM, Pan Du wrote:
>
>> Hi Herve
>>
>> I am using readGappedAlignmentPairs to retrieve the paired read
>> information from BAM files. I found the performance of the function
changed
>> a lot from sample to sample, and it seems not related with the size
of BAM
>> files.
>> Also the increase of running time is not linear with the increase
of
>> transcript number, from which we want to retrieve information.
Michael
>> suggests it may be related with how readGappedAlignmentPairs
function
>> responds to weird alignments. I want to know whether you also
observed
>> similar problems and whether there is a solution for it.
>>
>
> The time of the pairing algorithm (implemented in
findMateAlignment())
> depends of course on the number of records in the BAM file, but also
> on the "average nb of records per unique QNAME". You can quickly get
> this value with the quickCountBam() utility:
>
> > library(RNAseqData.HNRNPC.bam.**chr14)
> > bamfile <- RNAseqData.HNRNPC.bam.chr14_**BAMFILES[1L]
> > quickCountBam(bamfile, mainGroupsOnly=TRUE)
> group | nb of | nb of | mean /
max
> of | records | unique |
records per
> records | in group | QNAMEs | unique
QNAME
> All records.......................**. A | 800484 | 393300 | 2.04
/ 10
> o template has single segment.... S | 0 | 0 | NA /
NA
> o template has multiple segments. M | 800484 | 393300 | 2.04 /
10
> - first segment.............. F | 400242 | 393300 | 1.02 /
5
> - last segment............... L | 400242 | 393300 | 1.02 /
5
> - other segment.............. O | 0 | 0 | NA /
NA
>
> The value I'm talking about is the first value in the last column on
> the 3rd line (M group): 2.04. With paired-end reads, this value is
> always >= 2. If the aligner only reported primary alignments, it
should
> be exactly 2. It's > 2 only if secondary alignments were also
reported
> (up to 5 alignments per read for this BAM file).
>
> A value of 2 (i.e. only primary reads) is optimal for the pairing
> algorithm. A greater value, say > 3 will significantly degrade its
> performance. An easy way to avoid this degradation is to load only
> primary alignments by setting the appropriate flag in
ScanBamParam().
>
> Anyway, I just spent some time optimizing findMateAlignment() a
little
> bit and the timings are slightly better now (2x to 5x faster or
more,
> depending on the size of the file):
>
> nb of alignments | time | required memory
> -----------------+------------**--+----------------
> 8 millions | 32 sec | 1.7 GB
> 16 millions | 65 sec | 3.3 GB
> 32 millions | 2 min 25 sec | 6.5 GB
> 64 millions | 5 min 40 sec | 13.0 GB
>
> This is with Rsamtools 1.13.19 (BioC devel) which should become
> available via biocLite() in the next 36 hours or so.
>
> I should also mention that Martin and Val have plans to make
> readGappedAlignmentPairs() work on a BamFile object with a
> user-specified yieldSize. This will allow processing the file
> by chunk (like it's already possible with readGappedAlignments()).
>
> Cheers,
> H.
>
>
> Thanks!
>>
>> Pan
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages@fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
> ______________________________**_________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor="">
> Search the archives: http://news.gmane.org/gmane.**
> science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor="">
>
[[alternative HTML version deleted]]
Hi Michael, Pan,
On 06/21/2013 05:13 AM, Michael Lawrence wrote:
> The main problem with the performance of this function is that it
> appears to have quadratic complexity. The yield-size approach would
> mitigate that, but I'm left wondering why it has those
characteristics.
> Just R memory management?
> Anyway, I just spent some time optimizing findMateAlignment() a
little
> bit and the timings are slightly better now (2x to 5x faster or
more,
> depending on the size of the file):
>
> nb of alignments | time | required memory
> -----------------+------------__--+----------------
> 8 millions | 32 sec | 1.7 GB
> 16 millions | 65 sec | 3.3 GB
> 32 millions | 2 min 25 sec | 6.5 GB
> 64 millions | 5 min 40 sec | 13.0 GB
>
Doesn't look quadratic with respect to the size of the file.
But it is quadratic with respect to the "average nb of records per
unique QNAME". The above timings are for an GAlignments object coming
from a file where this value is 2.04 (close to the optimal value of
2).
If this value is 3, expect those timings to be multiplied by 3. If
it's
4, by 6x. If it's 5, by 10. If it's 6, by 15.
I don't think it has much to do with R memory management. It's more
the way the algo works: alignments are first grouped by QNAME, then
within each group, all possible pairs are examined. So if there are
6 alignments in a group, there are 15 pairs to examine. It's a very
bold algo and it could certainly be improved.
But as I said earlier, all this can be avoided by setting the
isNotPrimaryRead flag (flag 0x100) to FALSE in ScanBamParam().
That should *in theory* bring the "average nb of records per unique
QNAME" down to 2. Then the complexity is almost linear with respect
to the size of the BAM file (actually it's more something like
n x log(n) because of the grouping by QNAME that is based on qsort()).
However I've seen some aligners not setting properly the 0x100 flag
for secondary alignments so using the isNotPrimaryRead=FALSE does not
always bring the "average nb of records per unique QNAME" down to 2.
If you guys have BAM files where this value is 2 (or is > 2 but you
use isNotPrimaryRead=FALSE) and you still observe abnormal timings
(after you've upgraded to Rsamtools 1.13.19), please send them to
me (or make them available somewhere) and I'll have a look at it.
Thanks!
H.
> This is with Rsamtools 1.13.19 (BioC devel) which should become
> available via biocLite() in the next 36 hours or so.
>
> I should also mention that Martin and Val have plans to make
> readGappedAlignmentPairs() work on a BamFile object with a
> user-specified yieldSize. This will allow processing the file
> by chunk (like it's already possible with
readGappedAlignments()).
>
> Cheers,
> H.
>
>
> Thanks!
>
> Pan
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org="">
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
> Search the archives:
>
http://news.gmane.org/gmane.__science.biology.informatics.__conductor
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319