Hi all,
There has been an attempt to benchmark programming languages (mostly high level) for common tasks in Bioinformatics. One of the task includes simple parsing fastq file to count the number of reads, total sequence and quality lengths. I attempted this in pure base R and with Biostrings
package.
Relevant code is here
Below are the results in processing a sample fastq of 250,000 reads (first 100K lines of provided test dataset). Benchmarking is done with Hyperfine.
Asgard:biofast anand$ hyperfine --min-runs 3 './fqcnt/fqcnt.r test.fq'
Benchmark #1: ./fqcnt/fqcnt2.r test.fq
Time (mean ± σ): 6.078 s ± 0.291 s [User: 5.734 s, System: 0.170 s]
Range (min … max): 5.742 s … 6.248 s 3 runs
Asgard:biofast anand$ hyperfine --min-runs 3 './fqcnt/fqcnt.r test.fq.gz'
Benchmark #1: ./fqcnt/fqcnt2.r test.fq.gz
Time (mean ± σ): 6.031 s ± 0.373 s [User: 5.779 s, System: 0.119 s]
Range (min … max): 5.645 s … 6.389 s 3 runs
Asgard:biofast anand$ hyperfine --min-runs 3 './fqcnt/fqcnt_R_BioStrings.r test.fq'
Benchmark #1: ./fqcnt/fqcnt_R_BioStrings.r test.fq
Time (mean ± σ): 3.588 s ± 0.591 s [User: 3.190 s, System: 0.252 s]
Range (min … max): 3.144 s … 4.259 s 3 runs
Asgard:biofast anand$ hyperfine --min-runs 3 './fqcnt/fqcnt_R_BioStrings.r test.fq.gz'
Benchmark #1: ./fqcnt/fqcnt_R_BioStrings.r test.fq.gz
Time (mean ± σ): 10.481 s ± 0.386 s [User: 9.785 s, System: 0.416 s]
Range (min … max): 10.056 s … 10.810 s 3 runs
As expected base-R code is the slowest. However I am curious why Biostrings
struggles with gzip'd file? It is almost 3X slower compared to plain text fastq (and scales up with file size). This difference is quite small with base-R.
I suspect it has something to do with the file being opened with a connection and processed in chunks?
I have tested this on a MaxOS Catalina (8GB RAM, 2.7 GHz Dual-Core Intel Core i5)
> sessionInfo(package="Biostrings")
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
character(0)
other attached packages:
[1] Biostrings_2.56.0
loaded via a namespace (and not attached):
[1] compiler_4.0.0 graphics_4.0.0 tools_4.0.0 utils_4.0.0
[5] grDevices_4.0.0 stats_4.0.0 datasets_4.0.0 methods_4.0.0
[9] base_4.0.0
Any pointers on how to improve the code?