Question

Heng Li's fastq parsing benchmark - attempt in R

0

Entering edit mode

anand_mt ▴ 90

@anand_mt-10016

Last seen 4.0 years ago

Hi all,

There has been an attempt to benchmark programming languages (mostly high level) for common tasks in Bioinformatics. One of the task includes simple parsing fastq file to count the number of reads, total sequence and quality lengths. I attempted this in pure base R and with Biostrings package. Relevant code is here

Below are the results in processing a sample fastq of 250,000 reads (first 100K lines of provided test dataset). Benchmarking is done with Hyperfine.

Asgard:biofast anand$ hyperfine --min-runs 3 './fqcnt/fqcnt.r test.fq'
Benchmark #1: ./fqcnt/fqcnt2.r test.fq
  Time (mean ± σ):      6.078 s ±  0.291 s    [User: 5.734 s, System: 0.170 s]
  Range (min … max):    5.742 s …  6.248 s    3 runs

Asgard:biofast anand$ hyperfine --min-runs 3 './fqcnt/fqcnt.r test.fq.gz'
Benchmark #1: ./fqcnt/fqcnt2.r test.fq.gz
  Time (mean ± σ):      6.031 s ±  0.373 s    [User: 5.779 s, System: 0.119 s]
  Range (min … max):    5.645 s …  6.389 s    3 runs

Asgard:biofast anand$ hyperfine --min-runs 3 './fqcnt/fqcnt_R_BioStrings.r test.fq'
Benchmark #1: ./fqcnt/fqcnt_R_BioStrings.r test.fq
  Time (mean ± σ):      3.588 s ±  0.591 s    [User: 3.190 s, System: 0.252 s]
  Range (min … max):    3.144 s …  4.259 s    3 runs

Asgard:biofast anand$ hyperfine --min-runs 3 './fqcnt/fqcnt_R_BioStrings.r test.fq.gz'
Benchmark #1: ./fqcnt/fqcnt_R_BioStrings.r test.fq.gz
  Time (mean ± σ):     10.481 s ±  0.386 s    [User: 9.785 s, System: 0.416 s]
  Range (min … max):   10.056 s … 10.810 s    3 runs

As expected base-R code is the slowest. However I am curious why Biostrings struggles with gzip'd file? It is almost 3X slower compared to plain text fastq (and scales up with file size). This difference is quite small with base-R. I suspect it has something to do with the file being opened with a connection and processed in chunks?

I have tested this on a MaxOS Catalina (8GB RAM, 2.7 GHz Dual-Core Intel Core i5)

> sessionInfo(package="Biostrings")
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
character(0)

other attached packages:
[1] Biostrings_2.56.0

loaded via a namespace (and not attached):
[1] compiler_4.0.0  graphics_4.0.0  tools_4.0.0     utils_4.0.0    
[5] grDevices_4.0.0 stats_4.0.0     datasets_4.0.0  methods_4.0.0  
[9] base_4.0.0

Any pointers on how to improve the code?

biostrings gz • 1.1k views

ADD COMMENT • link updated 4.9 years ago by Martin Morgan 25k • written 4.9 years ago by anand_mt ▴ 90

score 1 · Answer 1 · 2020-05-23

Not answering your question, but...

Since there's a fast implementation in portable C, and since there is nothing useful being done with the computation (when integrating with the Bioconductor package ecosystem would pay off in a big way) I'd just wrap that in fqc_R.c (less than 1/2 dozen lines of new code)

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)

#include <Rinternals.h>

int do_fqc(const char *fname, int *result)
{
    gzFile fp;
    kseq_t *seq;
    int r, n = 0, slen = 0, qlen = 0;
    fp = gzopen(fname, "r");
    seq = kseq_init(fp);
    while ((r = kseq_read(seq)) >= 0)
            ++n, slen += seq->seq.l, qlen += seq->qual.l;
    if (r != -1) Rf_error("malformated FASTX");
    kseq_destroy(seq);
    gzclose(fp);

    result[0] = n, result[1] = slen, result[2] = qlen;
    return 0;
}

SEXP fqc(SEXP filename)
{
    SEXP result = PROTECT(Rf_allocVector(INTSXP, 3));

    do_fqc(Rf_translateChar(Rf_asChar(filename), INTEGER(result));

    UNPROTECT(1);
    return result;
}

Compile it

R CMD SHLIB fqc_R.c

and use it

> dyn.load("fqc_R")
 > system.time(result <- .Call("fqc", "../biofast-data-v1/M_abscessus_HiSeq.fq"))
   user  system elapsed
  3.009   0.196   3.206
> result
[1]   5682010 568201000 568201000

Probably I'd write some R code to make sure the file exists, etc., and to shield the user / myself from messing up the .Call().