Question

BiocParallel: bplapply invents data

0

Entering edit mode

robmaz77 ▴ 20

@dc2bd3d5

Last seen 4.1 years ago

I observe some weird behavior trying to read paired FASTQs in parallel. First the expected behavior without parallel execution (file1 and file2 contain exactly 1000 reads):

> fqStream <- FastqStreamerList(c(Fw=file1,Rv=file2),n=1000)
> fqChunk1 <- lapply(fqStream,yield)
> fqChunk1
$Fw
class: ShortReadQ
length: 1000 reads; width: 150 cycles

$Rv
class: ShortReadQ
length: 1000 reads; width: 150 cycles

> fqChunk2 <- lapply(fqStream,yield)
> fqChunk2
$Fw
class: ShortReadQ
length: 0 reads; width:  cycles

$Rv
class: ShortReadQ
length: 0 reads; width:  cycles

Now with bplapply instead of lapply:

> fqStream <- FastqStreamerList(c(Fw=file1,Rv=file2),n=1000)
> fqChunk3 <- bplapply(fqStream,yield)
> fqChunk3
$Fw
class: ShortReadQ
length: 1000 reads; width: 150 cycles

$Rv
class: ShortReadQ
length: 1000 reads; width: 150 cycles

> fqChunk4 <- bplapply(fqStream,yield)
> fqChunk4
$Fw
class: ShortReadQ
length: 146 reads; width: 150 cycles

$Rv
class: ShortReadQ
length: 137 reads; width: 150 cycles

As it turns out, the 146 and 137 reads are the first in each file again, and it appears this can be repeated and will never yield 0 reads.

> identical(list(Fw=compact(fqChunk1$Fw[1:146]),Rv=compact(fqChunk1$Rv[1:137])),compact(fqChunk4))
[1] TRUE

> fqChunk5 <- bplapply(fqStream,yield)
> fqChunk5
$Fw
class: ShortReadQ
length: 146 reads; width: 150 cycles

$Rv
class: ShortReadQ
length: 137 reads; width: 150 cycles

> identical(compact(fqChunk4),compact(fqChunk5))
[1] TRUE

Trying to read from the same stream using lapply will now also yield the same chunk over and over again, but with a warning:

> fqChunk6 <- lapply(fqStream,yield)
Warning messages:
1: IncompleteFinalRecord
  FastqStreamer yield() incomplete final record:
  test/Pool_517_1000_1.fq.gz 
2: IncompleteFinalRecord
  FastqStreamer yield() incomplete final record:
  test/Pool_517_1000_2.fq.gz

What is going on here?

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_AT.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] yaml_2.2.1                  txtplot_1.0-4               crayon_1.4.0               
 [4] data.table_1.13.6           ShortRead_1.48.0            GenomicAlignments_1.26.0   
 [7] SummarizedExperiment_1.20.0 Biobase_2.50.0              MatrixGenerics_1.2.1       
[10] matrixStats_0.58.0          Rsamtools_2.6.0             GenomicRanges_1.42.0       
[13] GenomeInfoDb_1.26.2         Biostrings_2.58.0           XVector_0.30.0             
[16] IRanges_2.24.1              S4Vectors_0.28.1            BiocParallel_1.24.1        
[19] BiocGenerics_0.36.0         docopt_0.7.1               

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13        zlibbioc_1.36.0        lattice_0.20-41        jpeg_0.1-8.1          
 [5] hwriter_1.3.2          tools_4.0.4            grid_4.0.4             png_0.1-7             
 [9] latticeExtra_0.6-29    Matrix_1.3-2           GenomeInfoDbData_1.2.4 RColorBrewer_1.1-2    
[13] codetools_0.2-17       bitops_1.0-6           RCurl_1.98-1.2         DelayedArray_0.16.1   
[17] compiler_4.0.4

BiocParallel ShortRead • 1.5k views

ADD COMMENT • link 4.1 years ago robmaz77 ▴ 20

score 0 · Answer 1 · 2021-03-25

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 3 months ago

United States

The file is opened on the 'manager', then internal information about how many reads have been read, etc, are updated on the worker without the manager knowing. The manager is left in a confused state. Similar problems occur with, e.g., a data base connection opened on the manager, but manipulated on the workers.

The solution is to open, manipulate, and close the file on the worker only.

ADD COMMENT • link 4.1 years ago Martin Morgan 25k

0

Entering edit mode

Hm. But how can that be implemented to read corresponding chunks from two files in parallel and then process them together? Obviously you have to return from the workers for processing, at which point the state is then lost before the next chunk can be read. There is no secret way to let "yield" know the offset where it should resume or something like that?

Also, it seems that some info is coming back from the workers, since subsequent calls do not return the same info. So does this mean there would be a way to update the manager object correctly, or should this be considered a bug and the actually expected behavior under these conditions would be to always return just the first chunk?

ADD REPLY • link 4.1 years ago robmaz77 ▴ 20

0

Entering edit mode

I guess you are processing two fastQ files from several samples. So the strategy would be to process each sample in a separate thread. The partial update of the object in the manager thread is really an artifact. There is no way to correctly retrieve the results of the partial update from the workers.

ADD REPLY • link 4.1 years ago Martin Morgan 25k

0

Entering edit mode

I was thinking specifically about forward and reverse reads of read pairs that need to be processed together. Since they happen to come in two corresponding fastq files, I was wondering whether reading from the two files could be sped up by doing it in parallel. It would be worth it because of the implied decompression step, I think. But alas, it looks like this is a no go. Many thanks anyway for the explanation.

ADD REPLY • link 4.1 years ago robmaz77 ▴ 20