Question

Biostrings: Error writing long reads (>200 kbps) with writeQualityScaledXStringSet()

0

Entering edit mode

Gerhard Thallinger ▴ 180

@gerhard-thallinger-1552

Last seen 6 months ago

Austria

I am processing ONT sequence files with very long reads (in the order of several 100 kbps) and I encounter an error during writeQualityScaledXStringSet():

library(Biostrings)

# Manually create a simple FASTQ file
nrep <- 20001
seqs <- paste0(rep("AGCTGATCGG", nrep), collapse="")
quals <- paste0(rep("ABCDEFGHIJ", nrep), collapse="")
descs <- sprintf("@S%01d_%07d", seq_along(seqs), nchar(seqs))
fastq.txt <- c(descs, seqs, "+", quals)
fname <- "example.fastq"
writeLines(fastq.txt, fname)

# Read and write the simple FASTQ file
qs.seqs <- readQualityScaledDNAStringSet(fname)
writeQualityScaledXStringSet(qs.seqs, fname)

# Error in .Call2("write_XStringSet_to_fastq", x, filexp_list, qualities,  : 
#   XStringSet object (or derivative) to write 'x' cannot contain strings
#   longer than 200000 ('x[[1]]' has 200010 characters)

This seems inconsistent as qs.seqs contains both sequence and quality information in full length.
Is there a reason for this restriction? It would be helpful if this limitation were lifted.

Furthermore a warning is issued, which results in an incomplete FASTQ file left on disk:

# Warning messages:
# 1: In file.remove(expath) :
#   cannot remove file 'example.fastq', reason 'Permission denied'
# 2: In writeXStringSet(x, filepath, append, compress, compression_level,  :
#   cannot remove file 'example.fastq'

Environent:

R version 4.4.1 Patched (2024-09-30 r87215 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

locale:
[1] LC_COLLATE=English_Austria.utf8  LC_CTYPE=English_Austria.utf8    LC_MONETARY=English_Austria.utf8
[4] LC_NUMERIC=C                     LC_TIME=English_Austria.utf8    

time zone: Europe/Vienna
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.72.1   GenomeInfoDb_1.40.1 XVector_0.44.0      IRanges_2.38.1      S4Vectors_0.42.1   
[6] BiocGenerics_0.50.0

loaded via a namespace (and not attached):
[1] httr_1.4.7              zlibbioc_1.50.0         compiler_4.4.1          R6_2.5.1               
[5] tools_4.4.1             GenomeInfoDbData_1.2.12 crayon_1.5.3            UCSC.utils_1.0.0       
[9] jsonlite_1.8.9

ONTdata Biostrings • 581 views

ADD COMMENT • link updated 4 months ago by Aidan ▴ 60 • written 6 months ago by Gerhard Thallinger ▴ 180

score 1 · Answer 1 · 2024-10-13

As far as I am aware, the restriction is historical..having a fixed buffer size in the internal C methods makes the algorithms a little simpler, and we were not expecting people to be writing sequences this big when writeXStringSet was first written like 15 years ago.

I am aware of this issue; that is one of the topics scheduled to be worked on in our ongoing Biostrings grant.

I do definitely agree that this restriction should be lifted, but it is a relatively major change to a core function in the codebase, so it might take some time.

Thank you for the reproducible example! I will open an issue on GitHub later with these details, and I am expecting to start working on this around November (after the upcoming release).