Dear all,
I have realized that msa::msaClustalW doesn't work when using foreach::foreach or BiocParallel::bplapply parallelization. Bellow, I send a small script that can reproduce this error.
library(Biostrings)
library(msa)
library(doParallel)
library(foreach)
library(dplyr)
seqs <- DNAStringSetList(c("A", "AT", "T")) %>%
rep(100)
registerDoParallel(cores=2)
res <- foreach(seqs_i = seqs) %dopar%
msaClustalW(seqs_i)
# ERROR: Cannot open output file [internalRsequence.dnd]
# ERROR: Wrong format in tree file internalRsequence.dnd
# Error in msaClustalW(seqs_i) :
# task 40 failed - "There is an invalid aln file!"
But, if we don't use parallelization, it works nicely:
res <- foreach(seqs_i = seqs) %do%
msaClustalW(seqs_i)
sessionInfo()
# R version 3.6.2 (2019-12-12)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 18.04.3 LTS
#
# Matrix products: default
# BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
# LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#
# locale:
# [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
# [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#
# attached base packages:
# [1] stats4 parallel stats graphics grDevices utils datasets
# [8] methods base
#
# other attached packages:
# [1] dplyr_0.8.3 doParallel_1.0.15 iterators_1.0.12
# [4] foreach_1.4.7 msa_1.16.0 Biostrings_2.52.0
# [7] XVector_0.24.0 IRanges_2.18.3 S4Vectors_0.22.1
# [10] BiocGenerics_0.30.0
#
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.2 rstudioapi_0.10 magrittr_1.5
# [4] zlibbioc_1.30.0 tidyselect_0.2.5 BiocParallel_1.18.1
# [7] R6_2.4.0 rlang_0.4.0 tools_3.6.2
# [10] assertthat_0.2.1 tibble_2.1.3 crayon_1.3.4
# [13] purrr_0.3.2 codetools_0.2-16 glue_1.3.1
# [16] compiler_3.6.2 pillar_1.4.2 pkgconfig_2.0.3
I also would like to inform that sometimes it works in parallelization. It seems that the higher is the length of the object seqs, the more likely it is to occur the error.
I could use msa(method="Muscle"), it works in parallel but causes memory leaks.
Could you give me any tips on how to do that, or tell me what I am doing wrong, please?
Thank you in advance. Best wishes.