Block processing matrices of different dimensions with DelayedArray
1
0
Entering edit mode
@plbaldoni
Last seen 4 days ago
Pittsburgh, United States

Hi all,

I am looking for a way to use HDF5Array and DelayedArray to operate together two large matrices of different dimensions (A and B below) that share equal number of rows and row-chunk geometry.

My immediate solution to this problem was to add zeros to the smaller matrix B (see B0 below), and then operate them together as a 3d array.

Is there a more efficient solution to this problem? I was thinking about something along the lines of a row-chunk iterator that would allow me to operate the two matrices without having to create the array x below.

Thanks!

> library(HDF5Array)
> 
> col_N <- 1e3
> col_n <- 3
> row_N <- 5e4
> 
> A <- matrix(runif(col_N*row_N), ncol = col_N, nrow = row_N)
> B <- matrix(runif(3*row_N,-1,1), ncol = col_n, nrow = row_N)
> 
> B0 <- cbind(B,matrix(0, ncol = col_N-col_n, nrow = row_N))
> 
> x <- writeHDF5Array(x = array(c(A,B0),dim = c(row_N,col_N,2)),
+                     chunkdim = c(5,col_N,2))
> 
> x
<50000 x 1000 x 2> array of class HDF5Array and type "double":
,,1
              [,1]      [,2]      [,3] ...    [,999]   [,1000]
    [1,] 0.9366992 0.3885370 0.9432474   . 0.8581199 0.2343847
    [2,] 0.9540822 0.5233401 0.9275795   . 0.8206745 0.1228269
     ...         .         .         .   .         .         .
[49999,] 0.2729992 0.2575988 0.2134806   . 0.1227398 0.9900415
[50000,] 0.9385657 0.9544607 0.6547487   . 0.5587343 0.4466597

,,2
                [,1]        [,2]        [,3] ...  [,999] [,1000]
    [1,]  -0.6489840   0.2446995   0.9654620   .       0       0
    [2,]  -0.2742762  -0.6363328   0.4642650   .       0       0
     ...           .           .           .   .       .       .
[49999,]  0.00637822  0.06429913 -0.90916511   .       0       0
[50000,] -0.45653045  0.70055948 -0.71097379   .       0       0

> 
> foo <- function(x) {
+   y <- (abs(x[,,1])^0.25 - 1)/0.25
+   
+   out <- lapply(seq_len(col_n),function(z){
+     sum(t(y) %*% x[,z,2])
+   })
+   
+   unlist(out)
+ }
> 
> blockFoo <- function(x) {
+   grid <- defaultAutoGrid(x)
+   block_foo <- blockApply(x, foo, grid = grid)
+   do.call(rbind,block_foo)
+ }
> 
> output <- blockFoo(x)
> 
> output
           [,1]       [,2]       [,3]
[1,]  42437.943 -21488.400  15951.636
[2,]   4590.788  58686.650 -52564.467
[3,]  35634.653 -25044.865  33538.275
[4,] -14556.621  -7956.171  12180.909
[5,] -12395.838 -16234.261  -3642.189
[6,]  49027.775  -3704.667  31522.995
[7,]  51266.787   3508.757 -25403.900
[8,]  24198.451  50484.589 -14383.886
> 
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /home/data/apps/R/R-4.1.0/lib64/R/lib/libRblas.so
LAPACK: /home/data/apps/R/R-4.1.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] HDF5Array_1.20.0     rhdf5_2.36.0         DelayedArray_0.18.0  IRanges_2.26.0       S4Vectors_0.30.0     MatrixGenerics_1.4.0
 [7] matrixStats_0.59.0   BiocGenerics_0.38.0  Matrix_1.3-4         devtools_2.4.1       usethis_2.0.1       

loaded via a namespace (and not attached):
 [1] compiler_4.1.0     rhdf5filters_1.4.0 prettyunits_1.1.1  remotes_2.4.0      tools_4.1.0        testthat_3.0.2     pkgbuild_1.2.0    
 [8] pkgload_1.2.1      memoise_2.0.0      lifecycle_1.0.0    lattice_0.20-44    rlang_0.4.11       cli_2.5.0          fastmap_1.1.0     
[15] withr_2.4.2        desc_1.3.0         fs_1.5.0           rprojroot_2.0.2    grid_4.1.0         glue_1.4.2         R6_2.5.0          
[22] processx_3.5.2     sessioninfo_1.1.1  callr_3.7.0        purrr_0.3.4        Rhdf5lib_1.14.0    magrittr_2.0.1     ps_1.6.0          
[29] ellipsis_0.3.2     cachem_1.0.5       crayon_1.4.1
HDF5Array DelayedArray • 1.6k views
ADD COMMENT
0
Entering edit mode

I don't know the answer to the question, but I'm wondering if this example code is doing the correct thing. My main concern is that your output has 8 rows. However, this is only due to the fact that defaultAutoGrid(x) returns a grid with 8 blocks. If you defined a different grid, or the dimensions of x were different, you'd get a different number of rows in the output. Is that really what you want?

ADD REPLY
1
Entering edit mode
@herve-pages-1542
Last seen 3 days ago
Seattle, WA, United States

I agree with Mike that what you are trying to do with your 2 matrices is somewhat obscure. It also seems unnecessarily complicated. For example I don't understand why you need to store the 2 matrices in a single H5 dataset. Why not keep them separated? Furthermore, it seems to me that you don't even need to walk on the blocks of the big matrix at all. You only seem to be walking on the columns of the small matrix.

But the most intriguing part is:

out <- lapply(seq_len(col_n),function(z){
    sum(t(y) %*% x[,z,2])
})

which means that you're going to repeat the costly matrix multiplication t(y) %*% x[,z,2] for each column of the small matrix.

I wonder if you could not just multiply t(y) with the full small matrix (i.e. t(y) %*% x[,,2]) and call colSums() on the result. Let's say your big and small matrices are x1 and x2, respectively, this would look something like:

y <- (abs(x1)^0.25 - 1) / 0.25  # everything should be delayed
colSums(t(y) %*% x2)  # will use block processing twice, 1st time in %*%, 2nd time in colSums()

I didn't try this so maybe I'm missing something. Is this something that you've tried? Did you run into problems?

H.

ADD COMMENT
0
Entering edit mode

Thanks Mike and Hervé. I should have been clearer in what I was trying to achieve and chosen a better exemple.

As Mike pointed out, the output of my blockFoo is missing a colSums() call to summarize the block processing. It does not matter how the chunks are defined as long as the same chunk of rows are selected from both matrices A and B.

I also realized that I was overcomplicating things, as you noted in your response, Hervé. In the dummy example foo that I created on the spot for the post, everything is delayed and your suggestion does work (thanks for that).

So, that is to say that my original intention was just to know if there was a way to simultaneously bring into memory row-chunks from two separate matrices that have the same number of rows, but different number of columns, within blockApply.

ADD REPLY
2
Entering edit mode

Yes, generally speaking, it should be possible to walk simultaneously on 2 or more DelayedArray objects with different dimensions, but not with blockApply().

Two important things about this:

1. You need to be able to define 2 grids, one for each object, and the 2 grids must be compatible:

You're going to walk on the 2 grids simultaneously, and at each step, you're going to load the block data from each object. For example, with your original example where you have 2 matrices, a big one A and a small one B, that share equal number of rows and row-chunk geometry, one way to define the 2 grids is:

A <- matrix(runif(5e7), nrow=5e4, ncol=1e3)
B <- matrix(runif(15e4), nrow=5e4, ncol=3)

A_grid <- rowAutoGrid(A, nrow=2000)
B_grid <- rowAutoGrid(B, nrow=2000)

The 2 grids are compatible i.e. they have the same nb of blocks:

length(A_grid)
# [1] 25
length(B_grid)
# [1] 25

and the i-th block in the 1st grid is aligned with the i-th block in the 2nd grid:

A_grid[[4L]]
# 2000 x 1000 ArrayViewport object on a 50000 x 1000 array: [6001-8000, ]
B_grid[[4L]]
# 2000 x 3 ArrayViewport object on a 50000 x 3 array: [6001-8000, ]

2. Use a for loop to simultaneously walk on the 2 matrices, instead of blockApply():

Pseudo-code:

nblock <- length(A_grid)  # same as length(B_grid)

for (bid in seq_len(nblock)) {
    ## Read block from 'A'.
    A_viewport <- A_grid[[bid]]
    A_block <- read_block(A, A_viewport)

    ## Read block from 'B'.
    B_viewport <- B_grid[[bid]]
    B_block <- read_block(B, B_viewport)

    ## Do something with A_block and B_block.
    ...
}

This only gives you an idea of the general pattern. Some important details are missing e.g. what to do with the results produced by the "Do something with A_block and B_block" step.

If these results are small enough so that you can keep them all in memory, then you could just store them in a pre-allocated list (note that this is basically what things like lapply(), mapply(), and blockApply() do):

nblock <- length(A_grid)  # same as length(B_grid)

results <- vector("list", nblock)
for (bid in seq_len(nblock)) {
    ...

    ## Do something with A_block and B_block.
    result <- ...

    ## Store the result in `results`.
    results[[bid]] <- result
}

With the above, you will end up with a list that has one list element per block in the grids, like with blockApply(). The final step will be to combine them and how to do this exactly depends on each situation.

But if the results are too big to be kept in memory, you should probably write them to disk as you go. You can use a RealizationSink object for this. See EXAMPLE 2 in ?write_block for an example of how to do this.

Hope this helps,

H.

ADD REPLY
0
Entering edit mode

Great, thanks a lot Hervé.

ADD REPLY

Login before adding your answer.

Traffic: 612 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6