I noticed that readGAlignmentPairs() is slow to read in some alignments for a gene when I provide the range of the gene as 'which' in the ScanBamParam. The gene is covered by some (likely) misaligned reads, which contain a gap that spans the entire range of the gene of interest. I've pre-filtered out an additional set of alignments which had quality score < 4, and tried various flags, but it's still slow to process the alignments as pairs. I have a subset BAM file (5Mb) which reproduces the problem which I will email to maintainer [at] bioc. The genome is hg19.
> system.time({ ga <- readGAlignments("subset.bam") }) user system elapsed 0.493 0.012 0.510 > length(ga) [1] 95502 > system.time({ ga <- readGAlignmentPairs("subset.bam") }) user system elapsed 1.870 0.014 1.887 Warning message: In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 6 pairs (0 proper, 6 not proper) were dropped because the seqname or strand of the alignments in the pair were not concordant. Note that a GAlignmentPairs object can only hold concordant pairs at the moment, that is, pairs where the 2 alignments are on the opposite strands of the same reference sequence. Calls: system.time ... readGAlignmentPairs -> .make_GAlignmentPairs_from_GAlignments > length(ga) [1] 38579 > system.time({ ga <- readGAlignmentPairs("subset.bam", param=ScanBamParam(which=GRanges("chr9",IRanges(19228763,19376137)))) }) user system elapsed 177.091 2.380 179.525 Warning message: In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 1 pairs (0 proper, 1 not proper) were dropped because the seqname or strand of the alignments in the pair were not concordant. Note that a GAlignmentPairs object can only hold concordant pairs at the moment, that is, pairs where the 2 alignments are on the opposite strands of the same reference sequence. Calls: system.time ... readGAlignmentPairs -> .make_GAlignmentPairs_from_GAlignments > system.time({ ga <- readGAlignmentPairs("subset.bam", param=ScanBamParam(which=GRanges("chr9",IRanges(19228763,19376137)), flag=scanBamFlag(isProperPair=TRUE, isSecondaryAlignment=FALSE))) }) user system elapsed 148.353 2.260 150.659 > length(ga) [1] 3190 > sessionInfo() R Under development (unstable) (2015-06-25 r68588) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: CentOS release 6.5 (Final) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices datasets utils [8] methods base other attached packages: [1] GenomicAlignments_1.5.12 Rsamtools_1.21.14 [3] Biostrings_2.37.2 XVector_0.9.1 [5] SummarizedExperiment_0.3.2 Biobase_2.29.1 [7] GenomicRanges_1.21.17 GenomeInfoDb_1.5.9 [9] IRanges_2.3.17 S4Vectors_0.7.12 [11] BiocGenerics_0.15.5 BiocInstaller_1.19.9 loaded via a namespace (and not attached): [1] Rcpp_0.11.6 bitops_1.0-6 futile.options_1.0.0 [4] git2r_0.10.1 zlibbioc_1.15.0 futile.logger_1.4.1 [7] lambda.r_1.1.7 BiocParallel_1.3.47 tools_3.3.0 >
Hi Michael,
Yes, making the file available somewhere or sending it to maintainer AT bioconductor.org would be helpful. Thanks!
H.