Thread safety of BSgenome getSeq()
1
0
Entering edit mode
@jeff-johnston-6497
Last seen 7.0 years ago
United States
Hi, I have encountered some issues using getSeq() on a BSgenome object inside a function parallelized with mclapply(). When calling getSeq() from multiple threads simultaneously, at least one will hang indefinitely using 100% CPU: #------------------ library(GenomicRanges) library(BSgenome.Dmelanogaster.UCSC.dm3) gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+") gr.list <- lapply(1:6, function(i) gr ) seqs.list <- mclapply(gr.list, function(gr) { message("getSeq() started") s <- getSeq(Dmelanogaster, gr) # does not reliably return if mc.cores > 1 message("getSeq() returned") s }, mc.cores=2) #------------------ If I instead load the BSgenome package inside the parallelized function everything is fine: #------------------ library(GenomicRanges) library(BSgenome.Dmelanogaster.UCSC.dm3) gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+") detach(name="package:BSgenome.Dmelanogaster.UCSC.dm3", unload=TRUE) gr.list <- lapply(1:6, function(i) gr ) seqs.list <- mclapply(gr.list, function(gr) { library(BSgenome.Dmelanogaster.UCSC.dm3) message("getSeq() started") s <- getSeq(Dmelanogaster, gr) # always works message("getSeq() returned") s }, mc.cores=2) #------------------ I can reproduce this issue on both Mac and Linux (both 64-bit). Is this just a limitation of BSgenome? Is there a better workaround than making sure the package is not loaded before the call to mclapply()? Thanks, Jeff Johnston Zeitlinger Lab Stowers Institute for Medical Research #------------------ > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.99 BSgenome_1.32.0 Biostrings_2.32.0 XVector_0.4.0 [5] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.8 BiocGenerics_0.10.0 [9] setwidth_1.0-3 loaded via a namespace (and not attached): [1] bitops_1.0-6 Rsamtools_1.16.0 stats4_3.1.0 zlibbioc_1.10.0 #------------------
BSgenome BSgenome BSgenome BSgenome • 1.0k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 6 days ago
Seattle, WA, United States
Hi Jeff, Sorry for the delay in answering. Thanks for reporting this. Martin Morgan was able to reproduce and found out that the problem is originating in the razf C code from Samtools (which is included in the Rsamtools package). The razf code provides the low level functionality for writing/reading RAZipp'ed compressed FASTA files, which is the format used in BioC 2.14 for storing the genomic sequences contained in the BSgenome data packages. Other issues have been reported with these RAZip based BSgenome packages, and, more importantly, it seems that the original Samtools developers have moved to other projects and are not going to maintain the razf code anymore. So this new issue you found convinced me that we should stop using the RAZip format in the BSgenome data packages. I'm currently in the process of rolling them back to the old on-disk format (i.e. 1 serialized DNAString object per chromosome). Expect them to show up in the BioC package repos in the next couple of days. They'll be at version 1.3.1000 (source packages will show up today, binary packages will follow). Also note that in devel (BioC 3.0), the BSgenome data packages use the 2bit format (from UCSC), which is superior in any aspect to the previous on-disk formats (i.e. in terms of speed, memory usage, and size on disk). Unfortunately I cannot use this format for the BSgenome data packages in release because that relies on functionalities in the BSgenome *software* package that are only available in BioC devel. Let me know if you have any questions or concerns about this. Cheers, H. On 06/06/2014 01:15 PM, Johnston, Jeffrey wrote: > Hi, > > I have encountered some issues using getSeq() on a BSgenome object inside a function parallelized with mclapply(). When calling getSeq() from multiple threads simultaneously, at least one will hang indefinitely using 100% CPU: > > #------------------ > library(GenomicRanges) > library(BSgenome.Dmelanogaster.UCSC.dm3) > gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+") > gr.list <- lapply(1:6, function(i) gr ) > > seqs.list <- mclapply(gr.list, function(gr) { > message("getSeq() started") > s <- getSeq(Dmelanogaster, gr) # does not reliably return if mc.cores > 1 > message("getSeq() returned") > s > }, mc.cores=2) > #------------------ > > If I instead load the BSgenome package inside the parallelized function everything is fine: > > #------------------ > library(GenomicRanges) > library(BSgenome.Dmelanogaster.UCSC.dm3) > gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+") > detach(name="package:BSgenome.Dmelanogaster.UCSC.dm3", unload=TRUE) > gr.list <- lapply(1:6, function(i) gr ) > > seqs.list <- mclapply(gr.list, function(gr) { > library(BSgenome.Dmelanogaster.UCSC.dm3) > message("getSeq() started") > s <- getSeq(Dmelanogaster, gr) # always works > message("getSeq() returned") > s > }, mc.cores=2) > #------------------ > > I can reproduce this issue on both Mac and Linux (both 64-bit). > > Is this just a limitation of BSgenome? Is there a better workaround than making sure the package is not loaded before the call to mclapply()? > > Thanks, > Jeff Johnston > Zeitlinger Lab > Stowers Institute for Medical Research > > #------------------ >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.99 BSgenome_1.32.0 Biostrings_2.32.0 XVector_0.4.0 > [5] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.8 BiocGenerics_0.10.0 > [9] setwidth_1.0-3 > > loaded via a namespace (and not attached): > [1] bitops_1.0-6 Rsamtools_1.16.0 stats4_3.1.0 zlibbioc_1.10.0 > #------------------ > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 525 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6