Entering edit mode
Hi,
You don't need to call findMateAlignment() when using
readGAlignmentPairs(). In a previous iteration, readGAlignmentPairs()
called findMateAlignment() but this is no longer the case.
readGAlignmentPairs() does have more overhead than readGAlignments()
because of the mate paring. These numbers are for a BAM file with
800484
records (400054 final pairs) on my not-so-speedy laptop.
library(microbenchmark)
library(RNAseqData.HNRNPC.bam.chr14)
fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
>> microbenchmark(readGAlignments(fls[1]), times=5)
> Unit: seconds
> expr min lq median uq max
neval
> readGAlignments(fls[1]) 1.753653 1.788292 1.85527 2.019071 2.121211
5
>>
## mate pairing
>> microbenchmark(readGAlignmentPairs(fls[1]), times=5)
> Unit: seconds
> expr min lq median uq
max neval
> readGAlignmentPairs(fls[1]) 9.283966 9.300328 9.535194 9.843282
11.37441 5
## mate pairing
>> microbenchmark(readGAlignmentsList(fls[1]), times=5)
> Unit: seconds
> expr min lq median uq
max neval
> readGAlignmentsList(fls[1]) 8.409743 8.457232 8.803696 9.692845
9.867628 5
You can set a 'yieldSize' when creating the BamFile. This will chunk
through the file 'yieldSize' records at a time and is useful when
processing a complete file and when there are memory limitations.
bf <- BamFile(myfile, yieldSize = 1000000)
Herve reported some additional timings for readGAlignmentPairs() on
this
post:
https://stat.ethz.ch/pipermail/bioconductor/2014-May/059695.html
Can you provide some numbers for how long your run is taking and the
number of records you're trying to read in?
I assume 'allExons' is used for counting against the BAM files in
later
code? Maybe this is why your were looking into summarizeOverlaps(). I
should point out that if you're only interested in the regions
overlapping 'allExons' you could call range(allExons) and use that as
the 'which' in the param. This would focus the area of interest and
speed up the reading.
Valerie
On 06/30/14 12:22, Fong Chun Chan wrote:
> Hi,
>
> I've been using the GenomicFeatures R package to read in some RNA-
seq
> paired-end read data. While the readGAlignments() function reads in
the bam
> file within a minute, I've noticed that the readGAlignmentPairs()
function
> is extremely slow.
>
> This is even after restricting the space to just a single chromosome
along
> with setting the flags isDuplicate = FALSE and isNotPrimaryRead =
FALSE
> (the latter being suggested in the reference):
>
> "An easy way to avoid this degradation is to load only primary
alignments
> by setting the isNotPrimaryRead ???ag to FALSE in ScanBamParam()"
>
> Based on my understanding, it seems that the ???ndMateAlignment()
function
> has to be run first. Is there any other way to speed it up? Perhaps
I am
> doing something wrong here?
>
> Code and sessionInfo() below.
>
> Thanks!
>
> ----
> library("GenomicFeatures")
> library('GenomicAlignments')
> library('Rsamtools')
> si <- seqinfo(BamFile(arguments$bamFile))
> gr <- GRanges(arguments$chr, IRanges(100,
> seqlengths(si)[[arguments$chr]]-100))
> allExons <- exons(txdb, columns = c('gene_id', 'exon_id',
'exon_name'))
> scf <- scanBamFlag( isDuplicate = FALSE, isNotPrimaryRead = FALSE )
#
> remove duplicate reads, remove non-primary reads (i.e. multi-aligned
reads)
> reads <- readGAlignmentPairs( arguments$bamFile, param =
ScanBamParam(
> which = gr, flag = scf ) ); # grab reads in specific region
>
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
LC_PAPER=en_US.UTF-8
> LC_NAME=C LC_ADDRESS=C
LC_TELEPHONE=C
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods
> base
>
> other attached packages:
> [1] argparse_1.0.1 proto_0.3-10
> GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.1
> Biostrings_2.32.0 XVector_0.4.0
GenomicFeatures_1.16.2
> AnnotationDbi_1.26.0 Biobase_2.24.0
GenomicRanges_1.16.3
> GenomeInfoDb_1.0.2 IRanges_1.22.9
BiocGenerics_0.10.0
> vimcom.plus_0.9-93
> [16] setwidth_1.0-3 colorout_1.0-2
>
> loaded via a namespace (and not attached):
> [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1
> biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6
checkmate_1.0
> codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2
> findpython_1.0.1 foreach_1.4.2 getopt_1.20.0
iterators_1.0.7
> plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1
> [19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2
> sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 tools_3.1.0
> XML_3.98-1.1 zlibbioc_1.10.0
>>
>
> [[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109
Email: vobencha at fhcrc.org
Phone: (206) 667-3158