I have a 50K features X 2M cells SingleCellExperiment
object backed by HDF5 (via the HDF5Array::saveHDF5SummarizeExperiment
). I am trying to do a summarization to pseudobulk, but am having difficulties. I would like some guidance/tips for improving the efficiency of this operation.
Previous implementation (in-memory)
With sparseMatrix
-based counts (i.e., Matrix
package), I usually do something like the following:
## counts from SingleCellExperiment
cts_feature_cell <- counts(sce)
## define 'cell' to 'group' mapping
M_cell_group <- t(Matrix::fac2sparse(sce$group))
## summarize to 'group'
cts_feature_group <- cts_feature_cell %*% M_cell_group
That is fast and memory efficient.
HDF5Array Counts Summarization
Trying to translate the above to HDF5-backed counts seems straightforward, but I feel I some of the details needed to optimize it are escaping me (e.g., interaction of sparsity, chunk dimensions, block size).
For now, I have been working with code like:
library(Matrix)
library(DelayedArray)
library(HDF5Array)
library(SingleCellExperiment)
library(BiocParallel)
##############
## SETTINGS ##
##############
## work with 1GB blocks
setAutoBlockSize(1e9)
## use local SSD
setHDF5DumpDir("/scratch")
## not too worried for space
setHDF5DumpCompressionLevel(1L)
## enable parallel ops
setAutoBPPARAM(MulticoreParam(24L))
##################
## DATA PROCESS ##
##################
sce <- loadHDF5SummarizedExperiment(dir, prefix)
cts_feature_cell <- counts(sce)
M_cell_group <- DelayedArray(t(fac2sparse(sce$group)))
## this does not finish after 48 hours
cts_tx_group <- cts_feature_cell %*% M_cell_group
After some simple tests on subsets, this seemed about as good as I could get it. I set it off to run and verified that all 24 cores were running at 100%. Unfortunately, this got killed after hitting a 48 hour run time limit. I am now reassessing and wondering what further could be improved.
Should I lay out the HDF differently? Currently, it uses chunkdim=c(N_FEATURE, 1e3)
.
Is there something inefficient about the matrix operation?
At this point, I feel like casting the counts back to sparseMatrix
in chunks, then running the sparse matrix multiplication and summing will be faster. But isn't that what DelayedArray is already doing in some sense?
Oh, that makes total sense! It might be several days before I get back to this, but I’ll eventually update with how things pan out.
Also, the references to existing implementations are much appreciated. Thanks for the insight!
Following back up. I reformatted to
chunkdim=c(1e3L, 1e3L)
and the summarization job with all else the same (24 cores, 1e9 blocks) took about 80 minutes walltime and max RAM usage of 52GB. I'm sure there's more room for tuning (e.g., cores had ~60% efficiency), but this is sufficient for now.Thanks again for the help!