write.XStringView/write.XStringSet highly inefficient
1
0
Entering edit mode
@michael-dondrup-3849
Last seen 10.2 years ago
Hi, I was trying to use write.XStringView on a larger dataset but to no avail. It seems like it is not implemented efficiently. What I am trying is: I downloaded http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz > library(Biostrings) > dnasts <- read.DNAStringSet(file="chr1.fa") # break up the fasta file into segments of size 60 > dnaviews <- Views(dnasts[[1]], start = seq(1, length(dnasts[[1]]), 60), width=60) > write.XStringViews(dnaviews, file="out.fa") ... I interrupted the process after 1h reaching a memory peak of over 3GB. In principle doing the whole task should not take longer than a few seconds. I found this report: https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2010-April/001160.html I guess that is the same problem? Has there been any progress? Is there probably a more efficient way of implementing this, e.g. using cat()? Thanks a lot Michael > sessionInfo() R version 2.11.1 (2010-05-31) x86_64-unknown-linux-gnu locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.16.9 IRanges_1.6.8 loaded via a namespace (and not attached): [1] Biobase_2.8.0 >
PROcess PROcess • 740 views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States
On 07/27/2010 04:56 AM, Michael Dondrup wrote: > Hi, > > I was trying to use write.XStringView on a larger dataset but to no avail. It seems like it is not implemented > efficiently. What I am trying is: > > I downloaded http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz > >> library(Biostrings) >> dnasts <- read.DNAStringSet(file="chr1.fa") Hi Michael -- This is also library(BSgenome.Hsapiens.UCSC.hg18) Hsapiens chr1 = unmasked(Hsapiens[["chr1"]]) > # break up the fasta file into segments of size 60 >> dnaviews <- Views(dnasts[[1]], start = seq(1, length(dnasts[[1]]), 60), width=60) ... and dnaviews <- Views(chr1, successiveIRanges(rep(60, ceiling(length(chr1) / 60)))) >> write.XStringViews(dnaviews, file="out.fa") > system.time(write.XStringSet(as(dnaviews, "DNAStringSet"), file=tempfile())) user system elapsed 7.024 0.756 8.030 but this is with > sessionInfo() R version 2.12.0 Under development (unstable) (2010-07-20 r52579) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.17.6 [3] Biostrings_2.17.24 GenomicRanges_1.1.17 [5] IRanges_1.7.12 loaded via a namespace (and not attached): [1] Biobase_2.9.0 tools_2.12.0 > ... I interrupted the process after 1h reaching a memory peak of over 3GB. > In principle doing the whole task should not take longer than a few seconds. I found this report: > https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2010-April/001160.html > I guess that is the same problem? Has there been any progress? so yes, there is progress but it requires use of the 'devel' version of R and Bioconductor. There were a couple of other posts in that thread fasta = character(2 * length(dna)) fasta[c(TRUE, FALSE)] = paste(">", names(dna), sep="") fasta[c(FALSE, TRUE)] = as.character(dna) writeLines(fasta, fl) and the more complete patch that seemed not to make it to the mailing list directly but that is in http://www.mail-archive.com/bioc-sig-sequencing at r-project.org/msg01135.html I wonder what you're going to do with your fasta file now? Hope that helps, Martin > > Is there probably a more efficient way of implementing this, e.g. using cat()? > > Thanks a lot > Michael > >> sessionInfo() > R version 2.11.1 (2010-05-31) > x86_64-unknown-linux-gnu > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biostrings_2.16.9 IRanges_1.6.8 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT

Login before adding your answer.

Traffic: 973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6