Hello,
I am trying to do differential methylation analysis on WGBS paired-end data from human samples on the whole genome. I ran bismark for alignement, then removed the PCR duplicates using deduplicate_bismark, and finally extracted the methylation (deduplicate_bismark):
bismark \
$genome \
--non_directional \
-o $output_dir \
-1 $fastq1 \
-2 $fastq2
deduplicate_bismark \
--output_dir $output_dir \
--bam \
--paired \
$bam
deduplicate_bismark \
--paired-end \
--buffer_size 18G \
--parallel 4 \
--scaffolds \
--gzip \
--bedGraph \
--cytosine_report \
--genome_folder $genome \
-o $output_dir \
$deduplicated_bam
I now have these files for each sample "<sample>_bismark_bt2_pe.deduplicated.bismark.cov.gz".
I then try to load the files using bbseq::read.bismark, and it does not progress after constructing the loci. Here is my R code (i use a "Targets" csv-like file in which i write the path to the files, the sample name, and their group (there are replicates)):
#!/usr/bin/env Rscript
library(BiocParallel)
library(bsseq)
sessionInfo()
BiocManager::valid()
targets_path <- "/path/to/Targets.txt"
# Loading data
targets<-read.delim(
targets_path,
header=TRUE,
sep="\t",
stringsAsFactors=FALSE
)
print(targets)
# file sampleID condition
# 1 /path/to/1_bismark_bt2_pe.deduplicated.bismark.cov.gz Sample1 Condition1
# 2 /path/to/2_bismark_bt2_pe.deduplicated.bismark.cov.gz Sample2 Condition1
# 3 /path/to/5_bismark_bt2_pe.deduplicated.bismark.cov.gz Sample5 Condition2
# 4 /path/to/6_bismark_bt2_pe.deduplicated.bismark.cov.gz Sample6 Condition2
fileList<-targets$file
sample.ids <- targets$sampleID
descriptiondf = data.frame(
Condition = factor(targets$condition),
row.names = sample.ids,
stringsAsFactors = FALSE
)
print(descriptiondf)
# Condition
# Sample1 Condition1
# Sample2 Condition1
# Sample5 Condition2
# Sample6 Condition2
# Reading
bsseq_bismark = bsseq::read.bismark(
files = fileList,
colData = descriptiondf,
rmZeroCov = FALSE,
strandCollapse = FALSE,
BPPARAM = MulticoreParam(workers = 9, progressbar = TRUE),
nThread = 3,
BACKEND = "HDF5Array",
verbose = TRUE
)
# [read.bismark] Parsing files and constructing valid loci ...
# |
# | | 0%
# |
# |======================= | 33%
# |
# |=============================================== | 67%
# |
# |======================================================================| 100%
# Done in 48.6 secs
# [read.bismark] Parsing files and constructing 'M' and 'Cov' matrices ...
# |
# | | 0%
# -> Progress bar doesn't move
Even after a few days, the progress bar doesn't move. I also see on the running process that the number of threads running isn't more than one, even tho there are a lot of R subprocess. It does take about 12Go RAM. I think it has to do with the size of the files (they are about 300Mo big, which is around 2Go when uncompressed). Moreover, when i run the script using only a subset of the files (targeting a specific region), it works in a few seconds.
- Is there something i can do with the read.bismark options ?
- Is there another way to do it, for example running the script multiple times on specific regions then merge the results ?
- Should i use another tool ?
Here are the full logs:
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
[read.bismark] Parsing files and constructing valid loci ...
|
| | 0%
|
|======================= | 33%
|
|=============================================== | 67%
|
|======================================================================| 100%
Done in 48.6 secs
[read.bismark] Parsing files and constructing 'M' and 'Cov' matrices ...
|
| | 0%
Here is my SessionInfo():
SessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Rocky Linux 9.1 (Blue Onyx)
Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP; LAPACK version 3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] bsseq_1.38.0 SummarizedExperiment_1.32.0
[3] Biobase_2.62.0 MatrixGenerics_1.14.0
[5] matrixStats_1.3.0 GenomicRanges_1.54.1
[7] GenomeInfoDb_1.38.8 IRanges_2.36.0
[9] S4Vectors_0.40.2 BiocGenerics_0.48.1
[11] BiocParallel_1.36.0
loaded via a namespace (and not attached):
[1] SparseArray_1.2.4 bitops_1.0-7
[3] gtools_3.9.5 lattice_0.22-6
[5] sparseMatrixStats_1.14.0 grid_4.3.3
[7] R.oo_1.26.0 Matrix_1.6-5
[9] R.utils_2.12.3 restfulr_0.0.15
[11] limma_3.58.1 BSgenome_1.70.2
[13] scales_1.3.0 permute_0.9-7
[15] HDF5Array_1.30.1 XML_3.99-0.16.1
[17] Biostrings_2.70.3 codetools_0.2-20
[19] abind_1.4-5 cli_3.6.2
[21] rlang_1.1.3 crayon_1.5.2
[23] XVector_0.42.0 R.methodsS3_1.8.2
[25] munsell_0.5.1 yaml_2.3.8
[27] DelayedArray_0.28.0 S4Arrays_1.2.1
[29] tools_4.3.3 parallel_4.3.3
[31] colorspace_2.1-0 Rhdf5lib_1.24.2
[33] Rsamtools_2.18.0 locfit_1.5-9.9
[35] GenomeInfoDbData_1.2.11 R6_2.5.1
[37] BiocIO_1.12.0 rhdf5_2.46.1
[39] lifecycle_1.0.4 rtracklayer_1.62.0
[41] zlibbioc_1.48.2 glue_1.7.0
[43] data.table_1.15.4 Rcpp_1.0.12
[45] statmod_1.5.0 GenomicAlignments_1.38.2
[47] rhdf5filters_1.14.1 rjson_0.2.21
[49] DelayedMatrixStats_1.24.0 compiler_4.3.3
[51] RCurl_1.98-1.14
Here is my BiocManager state:
BiocManager::valid()
[1] TRUE
Thank you for your help and time.
Superficially, your issue is similar to https://github.com/hansenlab/bsseq/issues/136. You seem to be using unnecessary parallelisation - it should only take a ~1 minute per file to load files this size. Please have a read of that thread. In particular, to debut it you might try:
nThread = 1L
. Please read the documentation (?read.bismark
) to better understand how this parameter influences the performance.BPPARAM = SerialParam()
. You only have 4 files so you certainly don't need 9 workers running in parallel.read.bismark()
first unzips the file to a temporary directory. It's possible I guess that something is hanging in that step (e.g., not enough space in the temporary directory)Hello,
Using nThread = 1L fixed the issue.
Thank you for your time and help.