Hi all,
I have been trying to use the framework from DelayedArray to compute some row-wise statistics from a relatively large HDF5Matrix. However, I am running into some efficiency issues in terms of computing time.
For example, below I am trying to apply a custom row-wise function on a HDF5Matrix mat
of 100,000 rows and 10,000 columns. To illustrate performance, I am computing it on a subset of the array (first 10,000 rows). In summary, it takes about 13 sec to compute the necessary statistics on a subset of 10% of the data (13 sec because I looped the array twice in the example below).
I have tried different configurations of blocking and chunking strategies, but the computing times end up in the same ballpark of 10-20s.
Are these figures somewhat expected from a computing time perspective? Besides parallel computing, does anyone have any suggestions of strategies that would help speed up computations in the example below?
Thanks!
> mat
<100000 x 10000> matrix of class HDF5Matrix and type "integer":
[,1] [,2] [,3] [,4] ... [,9997] [,9998] [,9999] [,10000]
[1,] 0 0 0 0 . 0 0 0 0
[2,] 0 0 1 0 . 0 0 0 0
[3,] 0 0 0 1 . 1 0 0 1
[4,] 0 0 0 0 . 0 0 0 1
[5,] 0 0 0 0 . 1 0 0 0
... . . . . . . . . .
[99996,] 0 0 0 0 . 1 0 1 0
[99997,] 0 0 0 0 . 0 0 1 0
[99998,] 0 0 0 0 . 1 0 0 0
[99999,] 1 0 1 0 . 0 0 0 0
[100000,] 0 0 0 0 . 0 0 0 0
> seed(mat)
An object of class "HDF5ArraySeed"
Slot "filepath":
[1] "/home/simulation/mockdata/mockdata_consensus_counts.h5"
Slot "name":
[1] "/counts"
Slot "as_sparse":
[1] FALSE
Slot "type":
[1] NA
Slot "dim":
[1] 100000 10000
Slot "chunkdim":
[1] 3162 316
Slot "first_val":
[1] 0
>
> foo <- function(x,mu){
+ rowSums(dpois(x,lambda = mu, log = TRUE))
+ }
>
> blockFoo <- function(mu,x,grid,FUN = foo) {
+ unlist(blockApply(x,FUN,grid = grid,mu = mu,verbose = TRUE))
+ }
>
> blockCombine <- function(x,mu,...) {
+ grid <- rowAutoGrid(x,...)
+ res <- lapply(X = mu,FUN = blockFoo,x = x,grid = grid)
+ do.call(cbind,res)
+ }
>
> microbenchmark(output = blockCombine(x = mat[1:10000,],mu = c(1,3)),times = 1)
/ Reading and realizing block 1/4 ... OK
\ Processing it ... OK
/ Reading and realizing block 2/4 ... OK
\ Processing it ... OK
/ Reading and realizing block 3/4 ... OK
\ Processing it ... OK
/ Reading and realizing block 4/4 ... OK
\ Processing it ... OK
/ Reading and realizing block 1/4 ... OK
\ Processing it ... OK
/ Reading and realizing block 2/4 ... OK
/ Reading and realizing block 2/4 ... OK
\ Processing it ... OK
/ Reading and realizing block 3/4 ... OK
\ Processing it ... OK
/ Reading and realizing block 4/4 ... OK
\ Processing it ... OK
Unit: seconds
expr min lq mean median uq max neval
output 26.61289 26.61289 26.61289 26.61289 26.61289 26.61289 1
> microbenchmark(output = blockCombine(x = mat[1:10000,],mu = c(1,3),nrow = 5000),times = 1)
/ Reading and realizing block 1/2 ... OK
\ Processing it ... OK
/ Reading and realizing block 2/2 ... OK
\ Processing it ... OK
/ Reading and realizing block 1/2 ... OK
\ Processing it ... OK
/ Reading and realizing block 2/2 ... OK
\ Processing it ... OK
Unit: seconds
expr min lq mean median uq max neval
output 21.82795 21.82795 21.82795 21.82795 21.82795 21.82795 1
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /home/System/data/apps/R/R-4.1.0/lib64/R/lib/libRblas.so
LAPACK: /home/System/data/apps/R/R-4.1.0/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4-7 HDF5Array_1.20.0 rhdf5_2.34.0 DelayedArray_0.18.0 IRanges_2.26.0 S4Vectors_0.30.0
[7] MatrixGenerics_1.4.0 matrixStats_0.58.0 BiocGenerics_0.38.0 Matrix_1.3-3 devtools_2.4.1 usethis_2.0.1
loaded via a namespace (and not attached):
[1] compiler_4.1.0 prettyunits_1.1.1 rhdf5filters_1.4.0 remotes_2.3.0 tools_4.1.0 testthat_3.0.2 pkgbuild_1.2.0
[8] pkgload_1.2.1 memoise_2.0.0 lifecycle_1.0.0 lattice_0.20-44 rlang_0.4.11 cli_2.5.0 fastmap_1.1.0
[15] withr_2.4.2 desc_1.3.0 fs_1.5.0 rprojroot_2.0.2 grid_4.1.0 glue_1.4.2 R6_2.5.0
[22] processx_3.5.2 sessioninfo_1.1.1 callr_3.7.0 purrr_0.3.4 Rhdf5lib_1.12.1 magrittr_2.0.1 ps_1.6.0
[29] ellipsis_0.3.2 cachem_1.0.5 crayon_1.4.1
I had more of a think about this, and realised that the duplication of the
Reading and realizing block 1/4
messages implied that the blocks were being processed once for each value ofmu
. Reading from disk is the slow part here, so we want to minimise the number of time the blocks are "processed" and perform the calculation for allmu
at the same time. One way to do this is to move thelapply()
step inside the block processing function.Here's some code that does something like that.
It actually still turns out to still be slower than my suggestion above, but if we align the block size with a multiple of the HDF5 chunk size we get slightly better performance.
Given the timings above, it looks like
dpois()
is a bottleneck. If the data are very sparse, perhaps there's a route to speeding it up since I suspect it's spending a long time calculatingdpois(0)
many times.Here's one possible strategy where we calculate the
dpois()
output for the range of values in the input matrix, and then build a matrix from those (x2
infaster_dpois()
). This assumesx
is +ve integers but I think that has to be the case fordpois()
anyway.This gets the run time down to a third of the original approach when combined with the other strategies. However, I also think this may end up using more memory, so it may be self defeating if you're trying to use on-disk arrays in a memory limited environment.
Thanks Mike, great tips there! I liked the idea you used in
faster_dpois
.Hi Mike, Pedro,
I don't have much to add to Mike's great answer. Thanks Mike!
Maybe 3 things worth mentioning:
1. dpois() and rowSums() can be used directly on a DelayedArray object:
For the particular situation where you just want to compute
rowSums(dpois(mat_disk, lambda=3, log=TRUE))
, you don't really need to implement your own block processing solution. You can dorowSums(dpois(mat_disk, lambda=3, log=TRUE))
directly on HDF5Array objectmat_disk
, and this will use block processing internally to compute the result:Note that
dpois()
is implemented as a delayed operation. This means that when you calldpois()
on a DelayedArray object, the operation is not performed on the moment. Instead a new DelayedArray object is returned that carries the delayeddpois()
operation.2. Estimating the cost of IO:
It can be useful to get an idea of the cost of IO. This cost can vary a lot from one machine to another depending on the performance of the disk. An easy way to do this is to call
blockApply()
on the HDF5Array object and to apply a function that does nothing. This will walk on the grid and load all the blocks sequentially but won't do anything to them:So on my machine the cost of IO is 11.133 sec. Think of it as the minimal cost that will have to be paid no matter what we'll do on each block.
Now if we process the blocks by applying a function that actually performs some operations on them, then we pay the additional cost of these operations:
This confirms Mike's observation that much more time is spent on computing
rowSums(dpois())
on each block than on loading them from the disk.3. Using precomputed
dpois()
values:Replacing
dpois()
with Mike'sfaster_dpois()
:Note that this can be made even more efficient by computing in advance all the possible
dpois()
values for a givenlambda
. This requires that we know an upper limit for the count values inmat_disk
:Cheers,
H.
FWIW here is the code I used to generate the mock data:
This is neat. I hadn't thought of using the datatype of the HDF5 dataset to set an upper limit. I actually wrote a further version that precomputed
dpois()
based on the max value ofmat
, but in practice was slower to iterate once through the whole matrix to find the maximum and then again to computerowSums()
vs doing everything on a per-block basis. I suppose your approach might be considerably different if the datatype were 32-bit integers.If the H5 datatype of the count values in
mat_disk
were 32-bit integers and we don't have an upper limit for these values, then I guess I would use yourfaster_dpois()
.FWIW a few months ago I started to work on fast implementations of basic summarization functions like
max()
,min()
,sum()
, etc... for H5 datasets:This is still work-in-progress and the code is not fully optimized yet. When this is ready, it will be used behind the scene when the user calls things like
max()
on an HDF5Array object or a subset of it.Being able to quickly compute the
max()
will make this kind of 2-pass approach (1st pass to get the max count, 2nd pass to computerowSums(dpois())
) more appealing.