I observe some weird behavior trying to read paired FASTQs in parallel. First the expected behavior without parallel execution (file1 and file2 contain exactly 1000 reads):
> fqStream <- FastqStreamerList(c(Fw=file1,Rv=file2),n=1000)
> fqChunk1 <- lapply(fqStream,yield)
> fqChunk1
$Fw
class: ShortReadQ
length: 1000 reads; width: 150 cycles
$Rv
class: ShortReadQ
length: 1000 reads; width: 150 cycles
> fqChunk2 <- lapply(fqStream,yield)
> fqChunk2
$Fw
class: ShortReadQ
length: 0 reads; width: cycles
$Rv
class: ShortReadQ
length: 0 reads; width: cycles
Now with bplapply instead of lapply:
> fqStream <- FastqStreamerList(c(Fw=file1,Rv=file2),n=1000)
> fqChunk3 <- bplapply(fqStream,yield)
> fqChunk3
$Fw
class: ShortReadQ
length: 1000 reads; width: 150 cycles
$Rv
class: ShortReadQ
length: 1000 reads; width: 150 cycles
> fqChunk4 <- bplapply(fqStream,yield)
> fqChunk4
$Fw
class: ShortReadQ
length: 146 reads; width: 150 cycles
$Rv
class: ShortReadQ
length: 137 reads; width: 150 cycles
As it turns out, the 146 and 137 reads are the first in each file again, and it appears this can be repeated and will never yield 0 reads.
> identical(list(Fw=compact(fqChunk1$Fw[1:146]),Rv=compact(fqChunk1$Rv[1:137])),compact(fqChunk4))
[1] TRUE
> fqChunk5 <- bplapply(fqStream,yield)
> fqChunk5
$Fw
class: ShortReadQ
length: 146 reads; width: 150 cycles
$Rv
class: ShortReadQ
length: 137 reads; width: 150 cycles
> identical(compact(fqChunk4),compact(fqChunk5))
[1] TRUE
Trying to read from the same stream using lapply will now also yield the same chunk over and over again, but with a warning:
> fqChunk6 <- lapply(fqStream,yield)
Warning messages:
1: IncompleteFinalRecord
FastqStreamer yield() incomplete final record:
test/Pool_517_1000_1.fq.gz
2: IncompleteFinalRecord
FastqStreamer yield() incomplete final record:
test/Pool_517_1000_2.fq.gz
What is going on here?
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.10.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_AT.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_AT.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=de_AT.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] yaml_2.2.1 txtplot_1.0-4 crayon_1.4.0
[4] data.table_1.13.6 ShortRead_1.48.0 GenomicAlignments_1.26.0
[7] SummarizedExperiment_1.20.0 Biobase_2.50.0 MatrixGenerics_1.2.1
[10] matrixStats_0.58.0 Rsamtools_2.6.0 GenomicRanges_1.42.0
[13] GenomeInfoDb_1.26.2 Biostrings_2.58.0 XVector_0.30.0
[16] IRanges_2.24.1 S4Vectors_0.28.1 BiocParallel_1.24.1
[19] BiocGenerics_0.36.0 docopt_0.7.1
loaded via a namespace (and not attached):
[1] rstudioapi_0.13 zlibbioc_1.36.0 lattice_0.20-41 jpeg_0.1-8.1
[5] hwriter_1.3.2 tools_4.0.4 grid_4.0.4 png_0.1-7
[9] latticeExtra_0.6-29 Matrix_1.3-2 GenomeInfoDbData_1.2.4 RColorBrewer_1.1-2
[13] codetools_0.2-17 bitops_1.0-6 RCurl_1.98-1.2 DelayedArray_0.16.1
[17] compiler_4.0.4
Hm. But how can that be implemented to read corresponding chunks from two files in parallel and then process them together? Obviously you have to return from the workers for processing, at which point the state is then lost before the next chunk can be read. There is no secret way to let "yield" know the offset where it should resume or something like that?
Also, it seems that some info is coming back from the workers, since subsequent calls do not return the same info. So does this mean there would be a way to update the manager object correctly, or should this be considered a bug and the actually expected behavior under these conditions would be to always return just the first chunk?
I guess you are processing two fastQ files from several samples. So the strategy would be to process each sample in a separate thread. The partial update of the object in the manager thread is really an artifact. There is no way to correctly retrieve the results of the partial update from the workers.
I was thinking specifically about forward and reverse reads of read pairs that need to be processed together. Since they happen to come in two corresponding fastq files, I was wondering whether reading from the two files could be sped up by doing it in parallel. It would be worth it because of the implied decompression step, I think. But alas, it looks like this is a no go. Many thanks anyway for the explanation.