how to deal with a 30G fastq file
1
0
Entering edit mode
wang peter ★ 2.0k
@wang-peter-4647
Last seen 10.2 years ago
IT is too slow to read them in the memory. who can tell me if i need split them by other program or call some R function to split them thx [[alternative HTML version deleted]]
• 1.0k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States
On 10/05/2011 08:04 PM, wang peter wrote: > IT is too slow to read them in the memory. > who can tell me if i need split them by other program or > call some R function to split them ShortRead::FastqSampler streams the entire file but returns a subset (often faster than reading in the whole data) ShortRead::FastqStreamer (in development) iterates over the file -- fq = FastqStreamer(<...>) while (length(res <- yield(fq))) # work, e.g., filter A cheap hack is to force R to allocate a large amount of memory and then to run the operation replicate(10, raw(1e9)) ## that's alot dna = readFastq(...) The 'withIds=FALSE' argument to readFastq can save a lot of time if ids are not necessary. If the records are all 4 lines long it is very easy to split a file (untested code; the Linux pros would use awk for efficient processing; check out StackOverflow / Biostar) fl = file("foo.fastq", "r") idx = 0; while (isIncomplete(fl)) { recs = readLines(fl, n=1000000) writeLines(sprintf("fout-%d.fastq", idx), recs) idx <- idx + 1 } once split on Linux / Mac use library(multicore) or library(parallel) (R-2.14 or later) and mclapply(seq_len(idx), function(i) { fq = readFastq(sprintf("fout-%d.fastq", idx)) ## work, then... TRUE }) to process in parallel (it doesn't make sense to try to read them in parallel and try to return them back to a 'master'). Martin > > thx > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD COMMENT
0
Entering edit mode
Hi Martin, Just wanted to say: On Wed, Oct 5, 2011 at 11:39 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > ?fq = FastqStreamer(<...>) > ?while (length(res <- yield(fq))) > ? ? ?# work, e.g., filter That's really cool! Then some navel gazing: Have you thought about "inverting" this flow? Like, run the while loop in "C-land" but pass an R expression/block/something in and have it be evaluated within each iteration of the C/while loop? I'm guessing calling an R function from within C code is costly, but "while" loops in R are also slow (compared to while loops in C), so I wonder which would win in the long run. Just curious -- sorry if I missed some previous discussion on this topic. Anyway, like I said -- this is really cool already. Thanks, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Hi Steve -- On 10/06/2011 06:48 AM, Steve Lianoglou wrote: > Hi Martin, > > Just wanted to say: > > On Wed, Oct 5, 2011 at 11:39 PM, Martin Morgan<mtmorgan at="" fhcrc.org=""> wrote: > >> fq = FastqStreamer(<...>) >> while (length(res<- yield(fq))) >> # work, e.g., filter > > That's really cool! Anita Lerch suggested and helped to implement this. > Then some navel gazing: > > Have you thought about "inverting" this flow? Like, run the while loop > in "C-land" but pass an R expression/block/something in and have it be > evaluated within each iteration of the C/while loop? > > I'm guessing calling an R function from within C code is costly, but > "while" loops in R are also slow (compared to while loops in C), so I > wonder which would win in the long run. Rsamtools::applyPileups does this. In some ways it's like lapply(<obj>, FUN), where the user provides FUN and applyPileups does work at the C level to prepare data for FUN. FUN is like # work -- they are expecting to do stuff on R objects using R code. For this reason they're both going to be efficient if they operate on vectors, hence chunks (e.g., millions of records) of the fastq or bam file. So yield() and applyPileups() have a similar task -- efficiently create a chunk of data to be processed, then pass that to the user. Since they're both function calls, they are both free to create those objects in R or C as appropriate. The big difference is really in how the results of the iteration or the apply are aggregated. yield() relies on the user to do something ('aggregate by writing to a file', or 'pre-allocate a result vector and fill in with each iteration') whereas applyPileups returns a list, with each element the result of FUN. If there were clear aggregation strategies then the apply-style approach might have additional advantages. This is still a bit of work in progress, so ideas welcome; one might easily image that lapply(FastqStreamer(<...>), FUN, ...) could be implemented in a straight-forward way, for instance. Martin > Just curious -- sorry if I missed some previous discussion on this topic. > > Anyway, like I said -- this is really cool already. > > Thanks, > > -steve > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD REPLY

Login before adding your answer.

Traffic: 512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6