addPatternDensity in 'qsea' takes too much time
1
0
Entering edit mode
mbk0asis • 0
@mbk0asis-7155
Last seen 2.3 years ago
Korea, Republic Of

Hello!

I'm using qsea to analyze my MBD-seq data.

I worked flawlessly until addPatternDensity was performed.

It took almost a week to process my data consist of 2 bam files for KO and control.

Running on the multiple threads using BiocParallel also took long time.

How can I speed up the process?

Thank you!!

> samples <- data.frame(sample_name = c("KO", "WT"),
                                     file_name = c("KO.bam", "WT.bam"),
                                     group = c("KO","WT"))

> path = wd

> samples$file_name=paste0(path,"/",samples$file_name )

> qseaSet = createQseaSet(sampleTable=samples,
                                             BSgenome="BSgenome.Hsapiens.UCSC.hg38",
                                             chr.select=paste0("chr", 1:22),
                                             window_size=300)
> qseaSet = addCoverage(qseaSet, uniquePos=TRUE, paired=TRUE)
> qseaSet=addCNV(qseaSet, file_name="file_name",window_size=2e6, 
                                paired=TRUE, parallel=FALSE, MeDIP=TRUE)
> qseaSet=addLibraryFactors(qseaSet)
> qseaSet=addPatternDensity(qseaSet, "CG", name="CpG")



>  sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] qlcMatrix_0.9.7             sparsesvd_0.2               slam_0.1-48                
 [4] Matrix_1.2-18               reshape2_1.4.4              forcats_0.5.0              
 [7] dplyr_1.0.2                 purrr_0.3.4                 readr_1.4.0                
[10] tidyr_1.1.2                 tibble_3.0.4                tidyverse_1.3.0            
[13] stringr_1.4.0               splitstackshape_1.4.8       patchwork_1.1.1            
[16] pcaExplorer_2.16.0          ggpubr_0.4.0                ggrepel_0.9.0              
[19] ggplot2_3.3.2               DESeq2_1.30.0               SummarizedExperiment_1.20.0
[22] Biobase_2.50.0              MatrixGenerics_1.2.0        matrixStats_0.57.0         
[25] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2         IRanges_2.24.1             
[28] S4Vectors_0.28.1            BiocGenerics_0.36.0        

loaded via a namespace (and not attached):
  [1] GOstats_2.56.0         readxl_1.3.1           backports_1.2.1        BiocFileCache_1.14.0  
  [5] NMF_0.23.0             plyr_1.8.6             igraph_1.2.6           lazyeval_0.2.2        
  [9] GSEABase_1.52.1        shinydashboard_0.7.1   splines_4.0.3          BiocParallel_1.24.1   
 [13] crosstalk_1.1.0.1      gridBase_0.4-7         digest_0.6.27          foreach_1.5.1         
 [17] htmltools_0.5.0        viridis_0.5.1          GO.db_3.12.1           fansi_0.4.1           
 [21] magrittr_2.0.1         memoise_1.1.0          cluster_2.1.0          doParallel_1.0.16     
 [25] limma_3.46.0           openxlsx_4.2.3         annotate_1.68.0        modelr_0.1.8          
 [29] docopt_0.7.1           askpass_1.1            prettyunits_1.1.1      colorspace_2.0-0      
 [33] rvest_0.3.6            blob_1.2.1             rappdirs_0.3.1         haven_2.3.1           
 [37] xfun_0.19              crayon_1.3.4           RCurl_1.98-1.2         jsonlite_1.7.2        
 [41] graph_1.68.0           genefilter_1.72.0      survival_3.2-7         iterators_1.0.13      
 [45] glue_1.4.2             registry_0.5-1         gtable_0.3.0           zlibbioc_1.36.0       
 [49] XVector_0.30.0         webshot_0.5.2          DelayedArray_0.16.0    car_3.0-10            
 [53] Rgraphviz_2.34.0       abind_1.4-5            SparseM_1.78           scales_1.1.1          
 [57] pheatmap_1.0.12        DBI_1.1.0              rngtools_1.5           rstatix_0.6.0         
 [61] Rcpp_1.0.5             viridisLite_0.3.0      xtable_1.8-4           progress_1.2.2        
 [65] foreign_0.8-79         bit_4.0.4              DT_0.16                AnnotationForge_1.32.0
 [69] htmlwidgets_1.5.3      httr_1.4.2             threejs_0.3.3          shinyAce_0.4.1        
 [73] RColorBrewer_1.1-2     ellipsis_0.3.1         pkgconfig_2.0.3        XML_3.99-0.5          
 [77] dbplyr_2.0.0           locfit_1.5-9.4         tidyselect_1.1.0       rlang_0.4.9           
 [81] later_1.1.0.1          AnnotationDbi_1.52.0   munsell_0.5.0          cellranger_1.1.0      
 [85] tools_4.0.3            cli_2.2.0              generics_0.1.0         RSQLite_2.2.1         
 [89] broom_0.7.3            shinyBS_0.61           evaluate_0.14          fastmap_1.0.1         
 [93] heatmaply_1.1.1        fs_1.5.0               knitr_1.30             bit64_4.0.5           
 [97] zip_2.1.1              dendextend_1.14.0      RBGL_1.66.0            mime_0.9              
[101] xml2_1.3.2             biomaRt_2.46.0         compiler_4.0.3         rstudioapi_0.13       
[105] plotly_4.9.2.2         curl_4.3               ggsignif_0.6.0         reprex_0.3.0          
[109] geneplotter_1.68.0     stringi_1.5.3          lattice_0.20-41        vctrs_0.3.6           
[113] pillar_1.4.7           lifecycle_0.2.0        data.table_1.13.4      bitops_1.0-6          
[117] seriation_1.2-9        httpuv_1.5.4           R6_2.5.0               TSP_1.1-10            
[121] promises_1.1.1         topGO_2.42.0           gridExtra_2.3          rio_0.5.16            
[125] codetools_0.2-18       assertthat_0.2.1       Category_2.56.0        openssl_1.4.3         
[129] pkgmaker_0.32.2        withr_2.3.0            GenomeInfoDbData_1.2.4 hms_0.5.3             
[133] grid_4.0.3             rmarkdown_2.6          carData_3.0-4          lubridate_1.7.9.2     
[137] shiny_1.5.0            base64enc_0.1-3
MBD-seq qsea BiocParallel • 1.2k views
ADD COMMENT
1
Entering edit mode
@matthias-lienhard-6292
Last seen 10 months ago
Max Planck Institute for molecular Geneā€¦

Hi,

Usually this function should not take more than a couple of minutes. I can think of two possible explanations why this step fails:

The first would be simple: you ran out of RAM. Can you please confirm that the machine is not swapping.

The second explanation would be unreasonable long insert size and/or high variance estimates. I explained the functionality of addPatternDensity in this answer, in response to question 2: A: MBDseq analysis using qsea package. In Brief, the function estimates for each genomic window the average number of CpGs per fragment that would be counted for this window. Thus if the fragment length is very long (say >10000) a single CpG impacts many 300 base windows. As the variance of the fragment size is also taken into account, this may get computationally expensive if the values exceed a reasonable range.

you should check the fragment length parameters:

qseaSet@libraries

You could manually provide reasonable fragment length parameters for addPatternDensity, e.g.

 qseaSet=addPatternDensity(qseaSet, "CG", name="CpG", fragment_length=200, fragment_sd=20)

This should be feasible to compute within the normal runtime, but probably its better to find the cause of the unreasonable estimated fragment length parameters and to filter the data correspondingly. E.g. if your data contains many reads with long insert size (e.g. discordant pairs) you could filter those in the bam.

Best, Matthias

ADD COMMENT

Login before adding your answer.

Traffic: 705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6