Dear BioC team,
I am trying to leverage the power of Rsamtools
to summarize reads in a number of ways. For instance, read length (seq) of reads mapping to certain genes. Here is simplified version of one my functions:
library("Rsamtools")
get_seq_lengths <- function(bamPath, coordinates = NULL){
if (!is.null(coordinates)){
param1 <- ScanBamParam(
which = coordinates,
what = c("seq")
)
} else {
param1 <- ScanBamParam(
what=c("seq")
)
}
seq_width <- scanBam(bamPath, param = param1)
seq_width <- width(seq_width[[1]]$seq)
tmp <- as.data.frame(table(seq_width))
tmp$Bam <- bamPath
return(tmp)
}
system.time(res <- get_seq_lengths(bam1))
user system elapsed
2.572 0.084 9.688
sum(Rsamtools::idxstatsBam(file = bam1)$mapped)
2592550
This process is actually quite fast (and memory efficient) if applied to the entire bam file. However, when filtering using genomic coordinates it is (1) very slow (see bellow for a test with 100.000 coordinates, Granges
object) or (2) eat up all the memory (8GB) and eventually crash if all the genomic locations are used (~3M).
system.time(
get_seq_lengths(
bam1,
coordinates = repeats_loc[1:10^5,]
)
)
user system elapsed
80.488 0.348 80.913
repeats_loc
GRanges object with 2442918 ranges and 3 metadata columns:
seqnames ranges strand | type gene_id
<Rle> <IRanges> <Rle> | <factor> <character>
[1] chr1 1-397 - | exon DNA-3-1_DR
[2] chr1 403-462 + | exon L2-5_DRe
[3] chr1 460-532 + | exon hAT-N25_DR
[4] chr1 469-614 + | exon DNA25TWA1_DR
[5] chr1 625-1027 + | exon L2-5_DRe
... ... ... ... . ... ...
[2442914] chrUn_KN150713v1 12946-13006 - | exon Gypsy13-I_DR
[2442915] chrUn_KN150713v1 13009-13341 - | exon ENSPM-2N_DR
[2442916] chrUn_KN150713v1 13211-13370 + | exon hAT-N79_DR
[2442917] chrUn_KN150713v1 13606-13668 - | exon Gypsy13-LTR_DR
[2442918] chrUn_KN150713v1 13927-14167 + | exon HE1_DR1
gene_biotype
<character>
[1] <NA>
[2] <NA>
[3] <NA>
[4] <NA>
[5] <NA>
... ...
[2442914] <NA>
[2442915] <NA>
[2442916] <NA>
[2442917] <NA>
[2442918] <NA>
-------
seqinfo: 1061 sequences from an unspecified genome; no seqlengths
My question is if there is a way to optimize this process?
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] rtracklayer_1.40.4 scales_1.0.0 ggplot2_3.0.0
[4] data.table_1.11.4 Rsamtools_1.32.2 Biostrings_2.48.0
[7] XVector_0.20.0 GenomicRanges_1.32.6 GenomeInfoDb_1.16.0
[10] IRanges_2.14.10 S4Vectors_0.18.3 BiocGenerics_0.26.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.18 pryr_0.1.4
[3] compiler_3.5.1 pillar_1.2.3
[5] plyr_1.8.4 bindr_0.1.1
[7] bitops_1.0-6 tools_3.5.1
[9] zlibbioc_1.26.0 lattice_0.20-35
[11] tibble_1.4.2 gtable_0.2.0
[13] pkgconfig_2.0.1 rlang_0.2.2
[15] Matrix_1.2-14 DBI_1.0.0
[17] DelayedArray_0.6.1 bindrcpp_0.2.2
[19] GenomeInfoDbData_1.1.0 stringr_1.3.1
[21] withr_2.1.2 dplyr_0.7.6
[23] grid_3.5.1 tidyselect_0.2.4
[25] Biobase_2.40.0 glue_1.3.0
[27] R6_2.2.2 RMySQL_0.10.15
[29] XML_3.98-1.15 BiocParallel_1.14.2
[31] purrr_0.2.5 magrittr_1.5
[33] codetools_0.2-15 matrixStats_0.53.1
[35] GenomicAlignments_1.16.0 SummarizedExperiment_1.10.1
[37] assertthat_0.2.0 colorspace_1.3-2
[39] stringi_1.2.4 RCurl_1.95-4.11
[41] lazyeval_0.2.1 munsell_0.5.0
Sorry for the belated reply, but working on other things got in the way. Thank your for the clarification regarding the separate call for each range - I had no idea it worked that way. For now I am sticking with GenomicAlignments followed by GR. Memory might be an issue down the line, but having reads being report multiple times is also less than ideal.