Biostrings: Error writing long reads (>200 kbps) with writeQualityScaledXStringSet()
1
0
Entering edit mode
@gerhard-thallinger-1552
Last seen 11 weeks ago
Austria

I am processing ONT sequence files with very long reads (in the order of several 100 kbps) and I encounter an error during writeQualityScaledXStringSet():

library(Biostrings)

# Manually create a simple FASTQ file
nrep <- 20001
seqs <- paste0(rep("AGCTGATCGG", nrep), collapse="")
quals <- paste0(rep("ABCDEFGHIJ", nrep), collapse="")
descs <- sprintf("@S%01d_%07d", seq_along(seqs), nchar(seqs))
fastq.txt <- c(descs, seqs, "+", quals)
fname <- "example.fastq"
writeLines(fastq.txt, fname)

# Read and write the simple FASTQ file
qs.seqs <- readQualityScaledDNAStringSet(fname)
writeQualityScaledXStringSet(qs.seqs, fname)

# Error in .Call2("write_XStringSet_to_fastq", x, filexp_list, qualities,  : 
#   XStringSet object (or derivative) to write 'x' cannot contain strings
#   longer than 200000 ('x[[1]]' has 200010 characters)

This seems inconsistent as qs.seqs contains both sequence and quality information in full length.
Is there a reason for this restriction? It would be helpful if this limitation were lifted.

Furthermore a warning is issued, which results in an incomplete FASTQ file left on disk:

# Warning messages:
# 1: In file.remove(expath) :
#   cannot remove file 'example.fastq', reason 'Permission denied'
# 2: In writeXStringSet(x, filepath, append, compress, compression_level,  :
#   cannot remove file 'example.fastq'

Environent:

R version 4.4.1 Patched (2024-09-30 r87215 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

locale:
[1] LC_COLLATE=English_Austria.utf8  LC_CTYPE=English_Austria.utf8    LC_MONETARY=English_Austria.utf8
[4] LC_NUMERIC=C                     LC_TIME=English_Austria.utf8    

time zone: Europe/Vienna
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.72.1   GenomeInfoDb_1.40.1 XVector_0.44.0      IRanges_2.38.1      S4Vectors_0.42.1   
[6] BiocGenerics_0.50.0

loaded via a namespace (and not attached):
[1] httr_1.4.7              zlibbioc_1.50.0         compiler_4.4.1          R6_2.5.1               
[5] tools_4.4.1             GenomeInfoDbData_1.2.12 crayon_1.5.3            UCSC.utils_1.0.0       
[9] jsonlite_1.8.9
ONTdata Biostrings • 448 views
ADD COMMENT
1
Entering edit mode
Aidan ▴ 60
@3efa9cc7
Last seen 14 days ago
United States

As far as I am aware, the restriction is historical..having a fixed buffer size in the internal C methods makes the algorithms a little simpler, and we were not expecting people to be writing sequences this big when writeXStringSet was first written like 15 years ago.

I am aware of this issue; that is one of the topics scheduled to be worked on in our ongoing Biostrings grant.

I do definitely agree that this restriction should be lifted, but it is a relatively major change to a core function in the codebase, so it might take some time.

Thank you for the reproducible example! I will open an issue on GitHub later with these details, and I am expecting to start working on this around November (after the upcoming release).

ADD COMMENT
0
Entering edit mode

Quick update: this is fixed in a branch awaiting PR to Biostrings (https://github.com/Bioconductor/Biostrings/pull/122). Should be in devel soonish, and will definitely be in the next release. Sorry for the slow turnaround, and thanks again for bringing this to our attention!

ADD REPLY

Login before adding your answer.

Traffic: 454 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6