Question

Reading huge bismark coverage files using bbseq::read.bismark

0

Entering edit mode

jacques.imbert • 0

@68132695

Last seen 12 months ago

France

Hello,

I am trying to do differential methylation analysis on WGBS paired-end data from human samples on the whole genome. I ran bismark for alignement, then removed the PCR duplicates using deduplicate_bismark, and finally extracted the methylation (deduplicate_bismark):

bismark \
        $genome \
        --non_directional \
        -o $output_dir \
        -1 $fastq1 \
        -2 $fastq2

deduplicate_bismark \
        --output_dir $output_dir \
        --bam \
        --paired \
        $bam

deduplicate_bismark \
    --paired-end \
    --buffer_size 18G \
    --parallel 4 \
    --scaffolds \
    --gzip \
    --bedGraph \
    --cytosine_report \
    --genome_folder $genome \
    -o $output_dir \
    $deduplicated_bam

I now have these files for each sample "<sample>_bismark_bt2_pe.deduplicated.bismark.cov.gz".

I then try to load the files using bbseq::read.bismark, and it does not progress after constructing the loci. Here is my R code (i use a "Targets" csv-like file in which i write the path to the files, the sample name, and their group (there are replicates)):

#!/usr/bin/env Rscript

library(BiocParallel)
library(bsseq)

sessionInfo()
BiocManager::valid()

targets_path <- "/path/to/Targets.txt"

# Loading data
targets<-read.delim(
    targets_path,
    header=TRUE,
    sep="\t",
    stringsAsFactors=FALSE
    )

print(targets)
#   file    sampleID    condition
# 1 /path/to/1_bismark_bt2_pe.deduplicated.bismark.cov.gz   Sample1 Condition1
# 2 /path/to/2_bismark_bt2_pe.deduplicated.bismark.cov.gz   Sample2 Condition1
# 3 /path/to/5_bismark_bt2_pe.deduplicated.bismark.cov.gz   Sample5 Condition2
# 4 /path/to/6_bismark_bt2_pe.deduplicated.bismark.cov.gz   Sample6 Condition2

fileList<-targets$file
sample.ids <- targets$sampleID

descriptiondf = data.frame( 
    Condition = factor(targets$condition),
    row.names = sample.ids,
    stringsAsFactors = FALSE
    )

print(descriptiondf)
#         Condition
# Sample1       Condition1
# Sample2       Condition1
# Sample5       Condition2
# Sample6       Condition2


# Reading
bsseq_bismark = bsseq::read.bismark(
    files = fileList,
    colData = descriptiondf,
    rmZeroCov = FALSE,
    strandCollapse = FALSE,
    BPPARAM = MulticoreParam(workers = 9, progressbar = TRUE),
    nThread = 3,
    BACKEND = "HDF5Array",
    verbose = TRUE
)

# [read.bismark] Parsing files and constructing valid loci ...

#   |                                                                            
#   |                                                                      |   0%
#   |                                                                            
#   |=======================                                               |  33%
#   |                                                                            
#   |===============================================                       |  67%
#   |                                                                            
#   |======================================================================| 100%

# Done in 48.6 secs
# [read.bismark] Parsing files and constructing 'M' and 'Cov' matrices ...

#   |                                                                            
#   |                                                                      |   0%
# -> Progress bar doesn't move

Even after a few days, the progress bar doesn't move. I also see on the running process that the number of threads running isn't more than one, even tho there are a lot of R subprocess. It does take about 12Go RAM. I think it has to do with the size of the files (they are about 300Mo big, which is around 2Go when uncompressed). Moreover, when i run the script using only a subset of the files (targeting a specific region), it works in a few seconds.

Is there something i can do with the read.bismark options ?
Is there another way to do it, for example running the script multiple times on specific regions then merge the results ?
Should i use another tool ?

Here are the full logs:

Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following object is masked from 'package:utils':

    findMatches

The following objects are masked from 'package:base':

    expand.grid, I, unname

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: 'MatrixGenerics'

The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: 'Biobase'

The following object is masked from 'package:MatrixGenerics':

    rowMedians

The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians

[read.bismark] Parsing files and constructing valid loci ...

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |======================================================================| 100%

Done in 48.6 secs
[read.bismark] Parsing files and constructing 'M' and 'Cov' matrices ...

  |                                                                            
  |                                                                      |   0%

Here is my SessionInfo():

SessionInfo()

R version 4.3.3 (2024-02-29)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Rocky Linux 9.1 (Blue Onyx)

Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] bsseq_1.38.0                SummarizedExperiment_1.32.0
 [3] Biobase_2.62.0              MatrixGenerics_1.14.0      
 [5] matrixStats_1.3.0           GenomicRanges_1.54.1       
 [7] GenomeInfoDb_1.38.8         IRanges_2.36.0             
 [9] S4Vectors_0.40.2            BiocGenerics_0.48.1        
[11] BiocParallel_1.36.0        

loaded via a namespace (and not attached):
 [1] SparseArray_1.2.4         bitops_1.0-7             
 [3] gtools_3.9.5              lattice_0.22-6           
 [5] sparseMatrixStats_1.14.0  grid_4.3.3               
 [7] R.oo_1.26.0               Matrix_1.6-5             
 [9] R.utils_2.12.3            restfulr_0.0.15          
[11] limma_3.58.1              BSgenome_1.70.2          
[13] scales_1.3.0              permute_0.9-7            
[15] HDF5Array_1.30.1          XML_3.99-0.16.1          
[17] Biostrings_2.70.3         codetools_0.2-20         
[19] abind_1.4-5               cli_3.6.2                
[21] rlang_1.1.3               crayon_1.5.2             
[23] XVector_0.42.0            R.methodsS3_1.8.2        
[25] munsell_0.5.1             yaml_2.3.8               
[27] DelayedArray_0.28.0       S4Arrays_1.2.1           
[29] tools_4.3.3               parallel_4.3.3           
[31] colorspace_2.1-0          Rhdf5lib_1.24.2          
[33] Rsamtools_2.18.0          locfit_1.5-9.9           
[35] GenomeInfoDbData_1.2.11   R6_2.5.1                 
[37] BiocIO_1.12.0             rhdf5_2.46.1             
[39] lifecycle_1.0.4           rtracklayer_1.62.0       
[41] zlibbioc_1.48.2           glue_1.7.0               
[43] data.table_1.15.4         Rcpp_1.0.12              
[45] statmod_1.5.0             GenomicAlignments_1.38.2 
[47] rhdf5filters_1.14.1       rjson_0.2.21             
[49] DelayedMatrixStats_1.24.0 compiler_4.3.3           
[51] RCurl_1.98-1.14

Here is my BiocManager state:

BiocManager::valid()
[1] TRUE

Thank you for your help and time.

HDF5Array BiocGenerics BiocParallel Biobase • 1.1k views

ADD COMMENT • link 12 months ago jacques.imbert • 0

score 2 · Accepted Answer · 2024-04-16

2

Entering edit mode

Peter Hickey ▴ 750

@petehaitch

Last seen 3 months ago

WEHI, Melbourne, Australia

I'm one of the bsseq developers. Are you able to share one of your files with me (e.g., via Dropbox, Google Drive, OneDrive, etc.). That'll make debugging much easier. FYI if you'd included the bsseq tag on the post then I would've been notified of it.

ADD COMMENT • link 12 months ago Peter Hickey ▴ 750

1

Entering edit mode

Superficially, your issue is similar to https://github.com/hansenlab/bsseq/issues/136. You seem to be using unnecessary parallelisation - it should only take a ~1 minute per file to load files this size. Please have a read of that thread. In particular, to debut it you might try:

Only use nThread = 1L. Please read the documentation (?read.bismark) to better understand how this parameter influences the performance.
Only reading 1 file instead of trying to load all 4.
Use BPPARAM = SerialParam(). You only have 4 files so you certainly don't need 9 workers running in parallel.
Unzipping 1 of the files manually and then reading the unzipped file in. To read a gzipped-file, the read.bismark() first unzips the file to a temporary directory. It's possible I guess that something is hanging in that step (e.g., not enough space in the temporary directory)

ADD REPLY • link 12 months ago Peter Hickey ▴ 750

0

Entering edit mode

Hello,

Using nThread = 1L fixed the issue.

Thank you for your time and help.

ADD REPLY • link 12 months ago jacques.imbert • 0