Hello,
I'm trying to extract the position and fragment length of paired end reads from bam files using readGAlignmentPairs and granges. There was already a posting (GenomicAlignments - Speeding up readGAlignmentPairs()) about the performance of readGAlignmentPairs, but another issue is that the granges method for GAlignmentPairs is much slower than the one from GAlignment.
I tried to find an alternative for the GAlignmentPairs granges method and found the code below to work between 4 and 7 times faster (the performance ratio depends on the size of the bam file):
ga is a larger GAlignmentPairs object.
mygranges = function(ga) {
l = granges(last(ga))
f = granges(first(ga))
pos = which(strand(f) == '+')
neg = which(strand(f) == '-')
gr.pos = GRanges(seqnames=seqnames(f[pos]),
ranges=IRanges(start=start(f[pos]), end=end(l[pos])), strand='+')
gr.neg = GRanges(seqnames=seqnames(f[neg]),
ranges=IRanges(start=start(l[neg]), end=end(f[neg])), strand='-')
i = order(c(pos, neg))
return(c(gr.pos, gr.neg)[i])
}
> system.time(gr <- mygranges(ga[1:1000000]))
user system elapsed
2.635 0.027 2.664
> system.time(gr.orig <- granges(ga[1:1000000]))
user system elapsed
11.593 0.117 11.721
> table(gr.orig == gr)
TRUE
1000000
<font face="monospace">For my full data set this is:</font>
> system.time(gr <- mygranges(ga))
user system elapsed
112.800 57.311 174.031
> system.time(gr.orig <- granges(ga))
user system elapsed
1033.727 263.714 1320.971
> table(gr.orig == gr) TRUE 47483073
It seems that the results are identical between granges and "mygranges" but the latter is much faster. However, I'm not sure that "mygranges" covers all of the original functionality (also, I'm still on 3.2, so I can't tell whether this was fixed in 3.3 or GenomicAlignments 1.8). The alternative implementation is still rather slow, but I can't think of anything simple that'd speed it up dramatically (at least not without digging into GAlignmentPairs).
If you think this granges implementation is correct (and covers all cases the original granges does) it might be worth considering it for GAlignmentPairs.
Arne
> sessionInfo()
R version 3.2.2 Patched (2015-09-03 r69280)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicAlignments_1.6.0 Rsamtools_1.22.0
[3] Biostrings_2.38.0 XVector_0.10.0
[5] SummarizedExperiment_1.0.0 Biobase_2.30.0
[7] GenomicRanges_1.22.0 GenomeInfoDb_1.6.0
[9] IRanges_2.4.0 S4Vectors_0.8.0
[11] BiocGenerics_0.16.0 BatchJobs_1.6
[13] BiocParallel_1.4.0 BiocInstaller_1.20.1
loaded via a namespace (and not attached):
[1] magrittr_1.5 zlibbioc_1.16.0 brew_1.0-6
[4] sendmailR_1.2-1 stringr_1.0.0 tools_3.2.2
[7] fail_1.2 checkmate_1.6.2 DBI_0.3.1
[10] lambda.r_1.1.7 futile.logger_1.4.1 digest_0.6.8
[13] bitops_1.0-6 base64enc_0.1-3 futile.options_1.0.0
[16] RSQLite_1.0.0 stringi_0.5-5 BBmisc_1.9
This is implemented in GenomicAlignments 1.8.3 (release) and 1.9.4 (devel). Both versions will become available via
biocLite()
in the next 48 hours or so.Thanks again for your feedback,
H.