Question

vcountPDict Out-of-Memory Errors

0

Entering edit mode

JennyS • 0

@jennylsmith12-16139

Last seen 4 months ago

United States

Hello,

I am trying to complete a sequence search using RNA-seq reads with the Biostrings function vcountPDict. However, I am running into out-of-memory errors when running vcountPDict() and I'm looking for some guidance with 1) my implementation may not be optimal or 2) what are typically the largest number of sequences in PDict object one can search for?

Background:

I have 4,722 50bp sequences and thier reverse complement that are saved into a PDict Object (length=9444 sequences). I have on average 5-6 million 75bp RNAseq reads from human Chr 16 that are saved as DNAstringSet object. I am using vcountPDict() function to search the DNAstringSet for the sequences saved in the PDict. But when I run the search with vcountPDict, I get an error "(Error: cannot allocate vector of size 154.7 Gb)" or whatever Gb it was looking for.

I've tried just making the PDict object smaller, but I'm down to 3,000 sequences and I still am getting more than half my searches failing with out-of-memory. So I'd like to hopefully figure out if I'm asking for too many searches and maybe only do 100-200 at a time? Or other suggestions?

CODE:

#create the sequence pattern dictionary
pdict_obj <- sequences %>% 
      pull(SeqExon, name=Name) %>%
      DNAStringSet()
 #Add reverse compliment to search 
 pdict_obj <- c(pdict_obj,  magrittr::set_names(reverseComplement(pdict_obj),
                                      paste0("revComp",names(pdict_obj))))
 pdict_obj <- PDict(pdict_obj)

#Pull out the RNAseq reads for chr16
gr <- GenomicRanges::GRanges(seqnames = Rle("16"),   ranges = IRanges(start=0, end=90354753))
params <- Rsamtools::ScanBamParam(which = gr, what = Rsamtools::scanBamWhat())
RNAseqReads <- Rsamtools::scanBam(bam_file, param = params)[[1]]$seq

#match the sequences
 read.matches <- vcountPDict(pdict = pdict_obj,
                              subject = RNAseqReads,
                              max.mismatch = 0, min.mismatch = 0, 
                              with.indels = FALSE)
  rownames(read.matches) <- names(pdict_obj)

OS Info: I have 80Gb of memory and 4 cores running on Ubuntu version "18.04.4 LTS (Bionic Beaver)".

SessionInfo:

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS/LAPACK: /app/software/OpenBLAS/0.3.7-GCC-8.3.0/lib/libopenblas_haswellp-r0.3.7.so

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GGally_2.0.0        gtools_3.8.2        survival_3.2-3      edgeR_3.30.3        limma_3.44.3        RColorBrewer_1.1-2  pheatmap_1.0.12    
 [8] purrr_0.3.4         reshape2_1.4.4      Biostrings_2.56.0   XVector_0.28.0      IRanges_2.22.2      S4Vectors_0.26.1    BiocGenerics_0.34.0
[15] DeGSEA_0.1.0        xlsx_0.6.3          readr_1.3.1         tibble_3.0.1        tidyr_1.1.0         dplyr_1.0.0         gridExtra_2.3      
[22] ggplot2_3.3.2       magrittr_1.5        stringr_1.4.0       knitr_1.29         

loaded via a namespace (and not attached):
 [1] splines_4.0.2          RcppParallel_5.0.2     StanHeaders_2.21.0-5   assertthat_0.2.1       highr_0.8              xlsxjars_0.6.1        
 [7] GenomeInfoDbData_1.2.3 Rsamtools_2.4.0        pillar_1.4.4           lattice_0.20-41        glue_1.4.1             digest_0.6.25         
[13] GenomicRanges_1.40.0   colorspace_1.4-1       Matrix_1.2-18          plyr_1.8.6             pkgconfig_2.0.3        rstan_2.19.3          
[19] zlibbioc_1.34.0        scales_1.1.1           processx_3.4.2         BiocParallel_1.22.0    generics_0.0.2         farver_2.0.3          
[25] ellipsis_0.3.1         withr_2.2.0            cli_2.0.2              crayon_1.3.4           evaluate_0.14          ps_1.3.3              
[31] fansi_0.4.1            pkgbuild_1.0.8         data.table_1.12.8      tools_4.0.2            loo_2.2.0              prettyunits_1.1.1     
[37] hms_0.5.3              lifecycle_0.2.0        matrixStats_0.56.0     munsell_0.5.0          locfit_1.5-9.4         callr_3.4.3           
[43] GenomeInfoDb_1.24.2    compiler_4.0.2         rlang_0.4.6            RCurl_1.98-1.2         grid_4.0.2             rstudioapi_0.11       
[49] bitops_1.0-6           labeling_0.3           gtable_0.3.0           inline_0.3.15          reshape_0.8.8          R6_2.4.1              
[55] rJava_0.9-12           stringi_1.4.6          Rcpp_1.0.5             vctrs_0.3.1            tidyselect_1.1.0       xfun_0.15

software error Biostrings • 928 views

ADD COMMENT • link 4.5 years ago JennyS • 0