Entering edit mode
Fong Chun Chan
▴
320
@fong-chun-chan-5706
Last seen 10.6 years ago
Hi,
I've been using the GenomicFeatures R package to read in some RNA-seq
paired-end read data. While the readGAlignments() function reads in
the bam
file within a minute, I've noticed that the readGAlignmentPairs()
function
is extremely slow.
This is even after restricting the space to just a single chromosome
along
with setting the flags isDuplicate = FALSE and isNotPrimaryRead =
FALSE
(the latter being suggested in the reference):
"An easy way to avoid this degradation is to load only primary
alignments
by setting the isNotPrimaryRead ï¬ag to FALSE in ScanBamParam()"
Based on my understanding, it seems that the ï¬ndMateAlignment()
function
has to be run first. Is there any other way to speed it up? Perhaps I
am
doing something wrong here?
Code and sessionInfo() below.
Thanks!
----
library("GenomicFeatures")
library('GenomicAlignments')
library('Rsamtools')
si <- seqinfo(BamFile(arguments$bamFile))
gr <- GRanges(arguments$chr, IRanges(100,
seqlengths(si)[[arguments$chr]]-100))
allExons <- exons(txdb, columns = c('gene_id', 'exon_id',
'exon_name'))
scf <- scanBamFlag( isDuplicate = FALSE, isNotPrimaryRead = FALSE ) #
remove duplicate reads, remove non-primary reads (i.e. multi-aligned
reads)
reads <- readGAlignmentPairs( arguments$bamFile, param = ScanBamParam(
which = gr, flag = scf ) ); # grab reads in specific region
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
LC_PAPER=en_US.UTF-8
LC_NAME=C LC_ADDRESS=C
LC_TELEPHONE=C
LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods
base
other attached packages:
[1] argparse_1.0.1 proto_0.3-10
GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.1
Biostrings_2.32.0 XVector_0.4.0
GenomicFeatures_1.16.2
AnnotationDbi_1.26.0 Biobase_2.24.0 GenomicRanges_1.16.3
GenomeInfoDb_1.0.2 IRanges_1.22.9 BiocGenerics_0.10.0
vimcom.plus_0.9-93
[16] setwidth_1.0-3 colorout_1.0-2
loaded via a namespace (and not attached):
[1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1
biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 checkmate_1.0
codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2
findpython_1.0.1 foreach_1.4.2 getopt_1.20.0
iterators_1.0.7
plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1
[19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2
sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 tools_3.1.0
XML_3.98-1.1 zlibbioc_1.10.0
>
[[alternative HTML version deleted]]