I have multiple MTX files output from a single-cell kallisto | bustools pipeline that I would like to filter and combine into a single SingleCellExperiment
. Previously, I have done this with Matrix::readMM
and then convert to a sparse matrix format (CsparseMatrix
). I've started hitting scalability issues with this approach and would like to switch to a DelayedArray
implementation (like HDF5Array
) for the counts
assay.
Is there a function to load such files directly as DelayedArray
objects?
Is there a recommended way to serially build up an HDF5 array from multiple samples?
Current Strategy
I have not been able to find such a method, and I am struggling to find documentation on creating HDF5 arrays from scratch. So far, I'm still loading to CpsarseMatrix
, filtering, and then saving out each using writeTENxMatrix
. Roughly, something like:
library(Matrix)
library(HDF5Array)
library(SingleCellExperiment)
library(magrittr)
## example of input files once loaded as `CsparseMatrix` objects and filtered
cts1 <- rsparsematrix(10, 10, 0.1)
cts2 <- rsparsematrix(10, 10, 0.1)
## dump each out to HDF5 (temp)
cts1 <- writeTENxMatrix(cts1)
cts2 <- writeTENxMatrix(cts2)
## create SCEs and combine
sce <- list(cts1, cts2) %>%
lapply(function (x) { SingleCellExperiment(assays=list(counts=x)) }) %>%
do.call(what=cbind)
## dump out combined HDF5
cts_all <- writeHDF5Array(counts(sce), "test.h5",
as.sparse=TRUE, verbose=TRUE, with.dimnames=TRUE)
## recreate SCE with unified counts
sce <- SingleCellExperiment(assays=list(counts=cts_all), colData=colData(sce))
## save final SCE
saveRDS(sce, "test.Rds")
Any pointers to better working with this format would be appreciated. The above seems convoluted. Ultimately, I want one HDF5 and one RDS file that are portable and contain all the data.