I am processing ONT sequence files with very long reads (in the order of several 100 kbps) and I encounter an error during writeQualityScaledXStringSet()
:
library(Biostrings)
# Manually create a simple FASTQ file
nrep <- 20001
seqs <- paste0(rep("AGCTGATCGG", nrep), collapse="")
quals <- paste0(rep("ABCDEFGHIJ", nrep), collapse="")
descs <- sprintf("@S%01d_%07d", seq_along(seqs), nchar(seqs))
fastq.txt <- c(descs, seqs, "+", quals)
fname <- "example.fastq"
writeLines(fastq.txt, fname)
# Read and write the simple FASTQ file
qs.seqs <- readQualityScaledDNAStringSet(fname)
writeQualityScaledXStringSet(qs.seqs, fname)
# Error in .Call2("write_XStringSet_to_fastq", x, filexp_list, qualities, :
# XStringSet object (or derivative) to write 'x' cannot contain strings
# longer than 200000 ('x[[1]]' has 200010 characters)
This seems inconsistent as qs.seqs
contains both sequence and quality information in full length.
Is there a reason for this restriction? It would be helpful if this limitation were lifted.
Furthermore a warning is issued, which results in an incomplete FASTQ file left on disk:
# Warning messages:
# 1: In file.remove(expath) :
# cannot remove file 'example.fastq', reason 'Permission denied'
# 2: In writeXStringSet(x, filepath, append, compress, compression_level, :
# cannot remove file 'example.fastq'
Environent:
R version 4.4.1 Patched (2024-09-30 r87215 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
locale:
[1] LC_COLLATE=English_Austria.utf8 LC_CTYPE=English_Austria.utf8 LC_MONETARY=English_Austria.utf8
[4] LC_NUMERIC=C LC_TIME=English_Austria.utf8
time zone: Europe/Vienna
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.72.1 GenomeInfoDb_1.40.1 XVector_0.44.0 IRanges_2.38.1 S4Vectors_0.42.1
[6] BiocGenerics_0.50.0
loaded via a namespace (and not attached):
[1] httr_1.4.7 zlibbioc_1.50.0 compiler_4.4.1 R6_2.5.1
[5] tools_4.4.1 GenomeInfoDbData_1.2.12 crayon_1.5.3 UCSC.utils_1.0.0
[9] jsonlite_1.8.9
Quick update: this is fixed in a branch awaiting PR to Biostrings (https://github.com/Bioconductor/Biostrings/pull/122). Should be in
devel
soonish, and will definitely be in the next release. Sorry for the slow turnaround, and thanks again for bringing this to our attention!