Hello,
I am trying to complete a sequence search using RNA-seq reads with the Biostrings function vcountPDict. However, I am running into out-of-memory errors when running vcountPDict() and I'm looking for some guidance with 1) my implementation may not be optimal or 2) what are typically the largest number of sequences in PDict object one can search for?
Background:
I have 4,722 50bp sequences and thier reverse complement that are saved into a PDict Object (length=9444 sequences). I have on average 5-6 million 75bp RNAseq reads from human Chr 16 that are saved as DNAstringSet object. I am using vcountPDict() function to search the DNAstringSet for the sequences saved in the PDict. But when I run the search with vcountPDict, I get an error "(Error: cannot allocate vector of size 154.7 Gb)" or whatever Gb it was looking for.
I've tried just making the PDict object smaller, but I'm down to 3,000 sequences and I still am getting more than half my searches failing with out-of-memory. So I'd like to hopefully figure out if I'm asking for too many searches and maybe only do 100-200 at a time? Or other suggestions?
CODE:
#create the sequence pattern dictionary
pdict_obj <- sequences %>%
pull(SeqExon, name=Name) %>%
DNAStringSet()
#Add reverse compliment to search
pdict_obj <- c(pdict_obj, magrittr::set_names(reverseComplement(pdict_obj),
paste0("revComp",names(pdict_obj))))
pdict_obj <- PDict(pdict_obj)
#Pull out the RNAseq reads for chr16
gr <- GenomicRanges::GRanges(seqnames = Rle("16"), ranges = IRanges(start=0, end=90354753))
params <- Rsamtools::ScanBamParam(which = gr, what = Rsamtools::scanBamWhat())
RNAseqReads <- Rsamtools::scanBam(bam_file, param = params)[[1]]$seq
#match the sequences
read.matches <- vcountPDict(pdict = pdict_obj,
subject = RNAseqReads,
max.mismatch = 0, min.mismatch = 0,
with.indels = FALSE)
rownames(read.matches) <- names(pdict_obj)
OS Info: I have 80Gb of memory and 4 cores running on Ubuntu version "18.04.4 LTS (Bionic Beaver)".
SessionInfo:
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Matrix products: default
BLAS/LAPACK: /app/software/OpenBLAS/0.3.7-GCC-8.3.0/lib/libopenblas_haswellp-r0.3.7.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] GGally_2.0.0 gtools_3.8.2 survival_3.2-3 edgeR_3.30.3 limma_3.44.3 RColorBrewer_1.1-2 pheatmap_1.0.12
[8] purrr_0.3.4 reshape2_1.4.4 Biostrings_2.56.0 XVector_0.28.0 IRanges_2.22.2 S4Vectors_0.26.1 BiocGenerics_0.34.0
[15] DeGSEA_0.1.0 xlsx_0.6.3 readr_1.3.1 tibble_3.0.1 tidyr_1.1.0 dplyr_1.0.0 gridExtra_2.3
[22] ggplot2_3.3.2 magrittr_1.5 stringr_1.4.0 knitr_1.29
loaded via a namespace (and not attached):
[1] splines_4.0.2 RcppParallel_5.0.2 StanHeaders_2.21.0-5 assertthat_0.2.1 highr_0.8 xlsxjars_0.6.1
[7] GenomeInfoDbData_1.2.3 Rsamtools_2.4.0 pillar_1.4.4 lattice_0.20-41 glue_1.4.1 digest_0.6.25
[13] GenomicRanges_1.40.0 colorspace_1.4-1 Matrix_1.2-18 plyr_1.8.6 pkgconfig_2.0.3 rstan_2.19.3
[19] zlibbioc_1.34.0 scales_1.1.1 processx_3.4.2 BiocParallel_1.22.0 generics_0.0.2 farver_2.0.3
[25] ellipsis_0.3.1 withr_2.2.0 cli_2.0.2 crayon_1.3.4 evaluate_0.14 ps_1.3.3
[31] fansi_0.4.1 pkgbuild_1.0.8 data.table_1.12.8 tools_4.0.2 loo_2.2.0 prettyunits_1.1.1
[37] hms_0.5.3 lifecycle_0.2.0 matrixStats_0.56.0 munsell_0.5.0 locfit_1.5-9.4 callr_3.4.3
[43] GenomeInfoDb_1.24.2 compiler_4.0.2 rlang_0.4.6 RCurl_1.98-1.2 grid_4.0.2 rstudioapi_0.11
[49] bitops_1.0-6 labeling_0.3 gtable_0.3.0 inline_0.3.15 reshape_0.8.8 R6_2.4.1
[55] rJava_0.9-12 stringi_1.4.6 Rcpp_1.0.5 vctrs_0.3.1 tidyselect_1.1.0 xfun_0.15