Question

addPatternDensity in 'qsea' takes too much time

0

Entering edit mode

mbk0asis • 0

@mbk0asis-7155

Last seen 2.8 years ago

Korea, Republic Of

Hello!

I'm using qsea to analyze my MBD-seq data.

I worked flawlessly until addPatternDensity was performed.

It took almost a week to process my data consist of 2 bam files for KO and control.

Running on the multiple threads using BiocParallel also took long time.

How can I speed up the process?

Thank you!!

> samples <- data.frame(sample_name = c("KO", "WT"),
                                     file_name = c("KO.bam", "WT.bam"),
                                     group = c("KO","WT"))

> path = wd

> samples$file_name=paste0(path,"/",samples$file_name )

> qseaSet = createQseaSet(sampleTable=samples,
                                             BSgenome="BSgenome.Hsapiens.UCSC.hg38",
                                             chr.select=paste0("chr", 1:22),
                                             window_size=300)
> qseaSet = addCoverage(qseaSet, uniquePos=TRUE, paired=TRUE)
> qseaSet=addCNV(qseaSet, file_name="file_name",window_size=2e6, 
                                paired=TRUE, parallel=FALSE, MeDIP=TRUE)
> qseaSet=addLibraryFactors(qseaSet)
> qseaSet=addPatternDensity(qseaSet, "CG", name="CpG")



>  sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] qlcMatrix_0.9.7             sparsesvd_0.2               slam_0.1-48                
 [4] Matrix_1.2-18               reshape2_1.4.4              forcats_0.5.0              
 [7] dplyr_1.0.2                 purrr_0.3.4                 readr_1.4.0                
[10] tidyr_1.1.2                 tibble_3.0.4                tidyverse_1.3.0            
[13] stringr_1.4.0               splitstackshape_1.4.8       patchwork_1.1.1            
[16] pcaExplorer_2.16.0          ggpubr_0.4.0                ggrepel_0.9.0              
[19] ggplot2_3.3.2               DESeq2_1.30.0               SummarizedExperiment_1.20.0
[22] Biobase_2.50.0              MatrixGenerics_1.2.0        matrixStats_0.57.0         
[25] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2         IRanges_2.24.1             
[28] S4Vectors_0.28.1            BiocGenerics_0.36.0        

loaded via a namespace (and not attached):
  [1] GOstats_2.56.0         readxl_1.3.1           backports_1.2.1        BiocFileCache_1.14.0  
  [5] NMF_0.23.0             plyr_1.8.6             igraph_1.2.6           lazyeval_0.2.2        
  [9] GSEABase_1.52.1        shinydashboard_0.7.1   splines_4.0.3          BiocParallel_1.24.1   
 [13] crosstalk_1.1.0.1      gridBase_0.4-7         digest_0.6.27          foreach_1.5.1         
 [17] htmltools_0.5.0        viridis_0.5.1          GO.db_3.12.1           fansi_0.4.1           
 [21] magrittr_2.0.1         memoise_1.1.0          cluster_2.1.0          doParallel_1.0.16     
 [25] limma_3.46.0           openxlsx_4.2.3         annotate_1.68.0        modelr_0.1.8          
 [29] docopt_0.7.1           askpass_1.1            prettyunits_1.1.1      colorspace_2.0-0      
 [33] rvest_0.3.6            blob_1.2.1             rappdirs_0.3.1         haven_2.3.1           
 [37] xfun_0.19              crayon_1.3.4           RCurl_1.98-1.2         jsonlite_1.7.2        
 [41] graph_1.68.0           genefilter_1.72.0      survival_3.2-7         iterators_1.0.13      
 [45] glue_1.4.2             registry_0.5-1         gtable_0.3.0           zlibbioc_1.36.0       
 [49] XVector_0.30.0         webshot_0.5.2          DelayedArray_0.16.0    car_3.0-10            
 [53] Rgraphviz_2.34.0       abind_1.4-5            SparseM_1.78           scales_1.1.1          
 [57] pheatmap_1.0.12        DBI_1.1.0              rngtools_1.5           rstatix_0.6.0         
 [61] Rcpp_1.0.5             viridisLite_0.3.0      xtable_1.8-4           progress_1.2.2        
 [65] foreign_0.8-79         bit_4.0.4              DT_0.16                AnnotationForge_1.32.0
 [69] htmlwidgets_1.5.3      httr_1.4.2             threejs_0.3.3          shinyAce_0.4.1        
 [73] RColorBrewer_1.1-2     ellipsis_0.3.1         pkgconfig_2.0.3        XML_3.99-0.5          
 [77] dbplyr_2.0.0           locfit_1.5-9.4         tidyselect_1.1.0       rlang_0.4.9           
 [81] later_1.1.0.1          AnnotationDbi_1.52.0   munsell_0.5.0          cellranger_1.1.0      
 [85] tools_4.0.3            cli_2.2.0              generics_0.1.0         RSQLite_2.2.1         
 [89] broom_0.7.3            shinyBS_0.61           evaluate_0.14          fastmap_1.0.1         
 [93] heatmaply_1.1.1        fs_1.5.0               knitr_1.30             bit64_4.0.5           
 [97] zip_2.1.1              dendextend_1.14.0      RBGL_1.66.0            mime_0.9              
[101] xml2_1.3.2             biomaRt_2.46.0         compiler_4.0.3         rstudioapi_0.13       
[105] plotly_4.9.2.2         curl_4.3               ggsignif_0.6.0         reprex_0.3.0          
[109] geneplotter_1.68.0     stringi_1.5.3          lattice_0.20-41        vctrs_0.3.6           
[113] pillar_1.4.7           lifecycle_0.2.0        data.table_1.13.4      bitops_1.0-6          
[117] seriation_1.2-9        httpuv_1.5.4           R6_2.5.0               TSP_1.1-10            
[121] promises_1.1.1         topGO_2.42.0           gridExtra_2.3          rio_0.5.16            
[125] codetools_0.2-18       assertthat_0.2.1       Category_2.56.0        openssl_1.4.3         
[129] pkgmaker_0.32.2        withr_2.3.0            GenomeInfoDbData_1.2.4 hms_0.5.3             
[133] grid_4.0.3             rmarkdown_2.6          carData_3.0-4          lubridate_1.7.9.2     
[137] shiny_1.5.0            base64enc_0.1-3

MBD-seq qsea BiocParallel • 1.3k views

ADD COMMENT • link updated 4.3 years ago by Matthias Lienhard ▴ 280 • written 4.3 years ago by mbk0asis • 0

score 1 · Answer 1 · 2021-01-04

Hi,

Usually this function should not take more than a couple of minutes. I can think of two possible explanations why this step fails:

The first would be simple: you ran out of RAM. Can you please confirm that the machine is not swapping.

The second explanation would be unreasonable long insert size and/or high variance estimates. I explained the functionality of addPatternDensity in this answer, in response to question 2: A: MBDseq analysis using qsea package. In Brief, the function estimates for each genomic window the average number of CpGs per fragment that would be counted for this window. Thus if the fragment length is very long (say >10000) a single CpG impacts many 300 base windows. As the variance of the fragment size is also taken into account, this may get computationally expensive if the values exceed a reasonable range.

you should check the fragment length parameters:

qseaSet@libraries

You could manually provide reasonable fragment length parameters for addPatternDensity, e.g.

 qseaSet=addPatternDensity(qseaSet, "CG", name="CpG", fragment_length=200, fragment_sd=20)

This should be feasible to compute within the normal runtime, but probably its better to find the cause of the unreasonable estimated fragment length parameters and to filter the data correspondingly. E.g. if your data contains many reads with long insert size (e.g. discordant pairs) you could filter those in the bam.

Best, Matthias