Entering edit mode
Thanks for looking into this Herv? and Martin, and for the quick fix.
It is working as expected now. I did have to install the updated
packages from source as the binaries do not seem to be updating for
Mac or Linux.
Thanks,
Jeff
On Jun 18, 2014, at 1:14 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote:
> Hi Jeff,
>
> Sorry for the delay in answering.
>
> Thanks for reporting this. Martin Morgan was able to reproduce and
> found out that the problem is originating in the razf C code from
> Samtools (which is included in the Rsamtools package). The razf code
> provides the low level functionality for writing/reading RAZipp'ed
> compressed FASTA files, which is the format used in BioC 2.14 for
> storing the genomic sequences contained in the BSgenome data
packages.
>
> Other issues have been reported with these RAZip based BSgenome
> packages, and, more importantly, it seems that the original Samtools
> developers have moved to other projects and are not going to
maintain
> the razf code anymore.
>
> So this new issue you found convinced me that we should stop using
> the RAZip format in the BSgenome data packages. I'm currently in the
> process of rolling them back to the old on-disk format (i.e. 1
> serialized DNAString object per chromosome). Expect them to show up
in
> the BioC package repos in the next couple of days. They'll be at
> version 1.3.1000 (source packages will show up today, binary
packages
> will follow).
>
> Also note that in devel (BioC 3.0), the BSgenome data packages use
the
> 2bit format (from UCSC), which is superior in any aspect to the
> previous on-disk formats (i.e. in terms of speed, memory usage, and
> size on disk). Unfortunately I cannot use this format for the
BSgenome
> data packages in release because that relies on functionalities in
> the BSgenome *software* package that are only available in BioC
devel.
>
> Let me know if you have any questions or concerns about this.
>
> Cheers,
> H.
>
> On 06/06/2014 01:15 PM, Johnston, Jeffrey wrote:
>> Hi,
>>
>> I have encountered some issues using getSeq() on a BSgenome object
inside a function parallelized with mclapply(). When calling getSeq()
from multiple threads simultaneously, at least one will hang
indefinitely using 100% CPU:
>>
>> #------------------
>> library(GenomicRanges)
>> library(BSgenome.Dmelanogaster.UCSC.dm3)
>> gr <-
GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"]
- 20, 10000), width=20), seqnames="chr2L", strand="+")
>> gr.list <- lapply(1:6, function(i) gr )
>>
>> seqs.list <- mclapply(gr.list, function(gr) {
>> message("getSeq() started")
>> s <- getSeq(Dmelanogaster, gr) # does not reliably return if
mc.cores > 1
>> message("getSeq() returned")
>> s
>> }, mc.cores=2)
>> #------------------
>>
>> If I instead load the BSgenome package inside the parallelized
function everything is fine:
>>
>> #------------------
>> library(GenomicRanges)
>> library(BSgenome.Dmelanogaster.UCSC.dm3)
>> gr <-
GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"]
- 20, 10000), width=20), seqnames="chr2L", strand="+")
>> detach(name="package:BSgenome.Dmelanogaster.UCSC.dm3", unload=TRUE)
>> gr.list <- lapply(1:6, function(i) gr )
>>
>> seqs.list <- mclapply(gr.list, function(gr) {
>> library(BSgenome.Dmelanogaster.UCSC.dm3)
>> message("getSeq() started")
>> s <- getSeq(Dmelanogaster, gr) # always works
>> message("getSeq() returned")
>> s
>> }, mc.cores=2)
>> #------------------
>>
>> I can reproduce this issue on both Mac and Linux (both 64-bit).
>>
>> Is this just a limitation of BSgenome? Is there a better workaround
than making sure the package is not loaded before the call to
mclapply()?
>>
>> Thanks,
>> Jeff Johnston
>> Zeitlinger Lab
>> Stowers Institute for Medical Research
>>
>> #------------------
>>> sessionInfo()
>> R version 3.1.0 (2014-04-10)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
LC_ADDRESS=C LC_TELEPHONE=C
LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets
methods base
>>
>> other attached packages:
>> [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.99 BSgenome_1.32.0
Biostrings_2.32.0 XVector_0.4.0
>> [5] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2
IRanges_1.22.8 BiocGenerics_0.10.0
>> [9] setwidth_1.0-3
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-6 Rsamtools_1.16.0 stats4_3.1.0
zlibbioc_1.10.0
>> #------------------
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319