The issue here is at the time of this writing rhdf5
treats data.frame
s specially (writing as COMPOUND
type by default) and COMPOUND
data types are opaque to rhdf5 (again, at time of this writing). See compoundAsDataFrame
and DataFrameAsCompound
in ?h5write
. Compound types are all-or-nothing; subsetting not supported. See post by Bernd, maintainer of rhdf5: C: Reading by column . To illustrate, try this in an R session:
library(rhdf5)
(h5fl <- tempfile(fileext=".h5"))
h5createFile(file=h5fl)
matr <- matrix(1:12, nrow=4) ## control: 2D obj
h5write(matr, h5fl, "matr")
df <- data.frame(a=1:4, b=c(1.1, 2.1, 3.1, 4.1), d=42:45)
h5write(df, h5fl, "dfcompound")
h5write(df, h5fl, "dfsep", DataFrameAsCompound=FALSE)
H5close()
h5ls(h5fl) ## matr dim: 4 x 3 (known, transparent); dfcompound dim: 3 (opaque)
## issues warning (but not error), but guesses it's doing the right
## thing because length(index) matches dimensional extent of
## dfcompound reported by h5ls
h5read(h5fl, "dfcompound", index=list(2))
h5read(h5fl, "dfsep", index=list(2:3)) ## also wrong
The solution is straightforward; just create a function that manages the fact the columns are written to separate datasets in vector-like fashion:
## allColNames: character(), names of *all* the data.frame's columns
## (in original order), not just those selected
readCompound <- function(file, name, allColNames, index=NULL, ...) {
rowSubset <- NULL
colnms <- allColNames
if( !is.null(index) ) {
if( !is.null(index[[1]]) )
rowSubset <- index[[1]]
if( !is.null(index[[2]]) )
colnms <- allColNames[index[[2]]]
}
coldatasets <- paste(name, colnms, sep="/")
df <- lapply(coldatasets, function(x) {
h5read(file, x, index=list(rowSubset), ...)
})
names(df) <- colnms
as.data.frame(df)
}
(res <- readCompound(h5fl, "dfsep", names(df)))
(res <- readCompound(h5fl, "dfsep", names(df), index=list(NULL, 1)))
(res <- readCompound(h5fl, "dfsep", names(df), index=list(2:4, NULL)))
(res <- readCompound(h5fl, "dfsep", names(df), index=list(2:4, 2:3)))
Should be easy to extrapolate writing to such a dataset.
As for adding rows to a data.frame, I suggest creating the datasets independently, but in a way that mirrors the effect of DataFrameAsCompound=FALSE
. And perhaps take advantage of creating a dataset with room to grow (via maxdims). But hdf5 represents repeated values pretty efficiently, so growing might not be necessary for your use case.
createDFDataset <- function(file, datasetnm, df, dims=dim(df), maxdims=dim(df)) {
coldatasetnms <- paste(datasetnm, names(df), sep="/")
stormodes <- vapply(df, storage.mode, character(1))
h5createGroup(file, datasetnm)
success <- vapply(seq_along(df), function(x) {
h5createDataset(file, coldatasetnms[x], dims, maxdims,
storage.mode=stormodes[x])
}, logical(1))
if(all(success)) TRUE else names(df)[which(!success)]
}
## maxdims with room to grow
(res <- createDFDataset(h5fl, "thedf", df, c(2, 3), c(8, 3)))
H5close()
h5ls(h5fl, all=TRUE)
Here's a quick illustration of growing a dataset and writing to it:
library(rhdf5)
vec <- sample(1:99, 10000, replace=TRUE)
(h5grow <- tempfile(fileext=".h5"))
h5createFile(h5grow)
h5createDataset(h5grow, "vec", 5000, 10000, storage.mode="integer")
h5write(vec[1:5000], h5grow, "vec")
h5set_extent(h5grow, "vec", 10000)
h5write(vec[5001:length(vec)], h5grow, "vec", index=list(5001:length(vec)))
Let me know if you have any questions.
-Nate