I answered Timothée via email, and am including the answer here for reference.
On 09/12/2014 05:40 AM, Timothée Flutre wrote:
> Thanks, I removed stringsAsFactors=TRUE from my ~/.Rprofile!
>
> When I have several files, I encountered the following error:
>> files <- dir("~/test", "*.fastq.gz$", full=TRUE) qas <- qaSummary(files,
>> type="fastq.gz")
> Error: could not find function "qaSummary"
>
> Even though it is present in the Overview vignette
> (http://www.bioconductor.org/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf).
>
> I guess that qaSummary() is in fact deprecated in favor of qa(), right?
I think in that document it's used as a variable name, not a function? I don't think there's ever been a qaSummary function.
>
> Moreover, when fed with several large fastq files, qa() seems much slower than
> FastQC. Would it be possible to add a progress bar to qa()? For instance, via
> this function
> (http://stat.ethz.ch/R-manual/R-patched/library/utils/html/txtProgressBar.html)
> or this package (http://cran.r-project.org/web/packages/pbapply/)? I had a quick
> look at the ShortRead pkg source code, but couldn't find easily where to add this.
I'm not sure a progress bar would make it go faster but yes, that's a good idea; it should be added to BiocParallel. Adding verbose=TRUE will report as each file is being processed.
> I also tried to get a sense of the time it takes to run a single file, but
> encountered the following error:
>> system.time(qa <- qa(dirPath="~/test", pattern="RPI2_S1_L001_R1_001.fastq.gz",
>> type="fastq", sample=TRUE))
> user system elapsed
> 26.719 0.490 22.565
>> system.time(qa <- qa(dirPath="~/test", pattern="RPI2_S1_L001_R1_001.fastq.gz",
>> type="fastq", sample=FALSE))
> Error: 1 errors; first error:
> Error: UserArgumentMismatch
> 'pattern' must be 'character(0) or character(1)'
> For more information, use bplasterror(). To resume calculation, re-call
> the function and set the argument 'BPRESUME' to TRUE or wrap the
> previous call in bpresume().
> First traceback:
> 28: system.time(qa <- qa(dirPath = "~/test",
> pattern = "RPI2_S1_L001_R1_001.fastq.gz", type = "fastq",
> sample = FALSE))
> 27: qa(dirPath = "~/test",
> pattern = "RPI2_S1_L001_R1_001.fastq.gz", type = "fastq",
> sample = FALSE)
> 26: qa(dirPath = "~/test",
> pattern = "RPI2_S1_L001_R1_001.fastq.gz", type = "fastq",
> sample = FALSE)
> 25: .local(dirPath, ...)
> 24: .qa_fastq(dirPath, pattern, type = type, ...)
> 23: bplapply(fls, .qa_fastq_lane, type = type, ..., verbose = verbose)
> 22: bplapply(fls
> Timing stopped at: 0.013 0 0.013
>> bplasterror()
> 0 / 1 partial results stored. First 1 error messages:
> [1]: Error: UserArgumentMismatch
> 'pattern' must be 'character(0) or character(1)'
>
> I don't understand why the same command works with sample=TRUE, but doesn't with
> sample=FALSE.
That was a bug introduced fairly recently; it's been corrected in the 'devel' version 1.23.17 available all being well about this time tomorrow. Instructions for using the devel version are at
http://bioconductor.org/developers/how-to/useDevel/
I don't expect reading the full file to be anywhere near competitive with fastqc, but since we're interested in summary statistics it doesn't have to be!
For timing I see
$ time ./fastqc ~/benchmark/E-MTAB-1147/fastq/ERR127302_1.fastq.gz
[...]
real 2m37.193s
user 2m36.463s
sys 0m2.186s
versus
$ cat qa-test.R
suppressPackageStartupMessages(library(ShortRead))
fl <- "~/benchmark/E-MTAB-1147/fastq/ERR127302_1.fastq.gz"
rpt <- report(qa(fl))
~/benchmark/ShortRead-qa$ time R --silent --vanilla -f qa-test.R
> suppressPackageStartupMessages(library(ShortRead))
> fl <- "~/benchmark/E-MTAB-1147/fastq/ERR127302_1.fastq.gz"
> rpt <- report(qa(fl))
>
real 1m47.893s
user 1m44.692s
sys 0m3.068s
This uses the default sampling of 1M reads using about 4G of RAM (which seems a little excessive...); ShortRead will run in parallel if fed several files; see ?qa and the BPPARAM argument to control how parallel evaluation works, in particular you might want to make sure you don't consume all memory (e.g., using options(mc.cores=2), as this will cause swapping and serious performance degradation.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.