Question

How can I add rows to a dataframe in an HDF5 file using rhdf5?

0

Entering edit mode

serefarikan • 0

@serefarikan-8314

Last seen 9.6 years ago

United Kingdom

Greetings,

I have a piece of R code that generates values that I'd like to add to a dataframe. Even if I pre-allocate dataframe rows and assign new rows to avoid performance issues, performance is still not good enough. In some cases, I can't pre-allocate the dataframe because I don't know the final size (the loop checks some conditions during runtime)

I though of appending rows to an hdf5 file which is convenient for me in various ways. My dataframe has both numeric and string values. Based on rhdf5 documentation, I can see that I can persist a dataframe and read it back. However, both operations are for the whole data set, which does not work for me because having to build the dataframe in memory takes me back to my original problem. Can I create an empty dataframe with column definitions in R, save it using rhdf5 to a file and then insert rows without having to read the whole dataframe? Ideally, I'd like to hold some reference to dataframe in the hdf5 file and add rows to it. I've done this using rpg package and postgresql but I suspect hdf5 would be much faster, that is if rhdf5 supports what I'm trying to do.

Cheers

Seref

rhdf5 hdf5 • 3.8k views

ADD COMMENT • link updated 9.6 years ago by Nathaniel Hayden ▴ 180 • written 9.6 years ago by serefarikan • 0

score 1 · Answer 1 · 2015-07-01

The issue here is at the time of this writing rhdf5 treats data.frames specially (writing as COMPOUND type by default) and COMPOUND data types are opaque to rhdf5 (again, at time of this writing). See compoundAsDataFrame and DataFrameAsCompound in ?h5write. Compound types are all-or-nothing; subsetting not supported. See post by Bernd, maintainer of rhdf5: C: Reading by column . To illustrate, try this in an R session:

library(rhdf5)

(h5fl <- tempfile(fileext=".h5"))
h5createFile(file=h5fl)
matr <- matrix(1:12, nrow=4) ## control: 2D obj
h5write(matr, h5fl, "matr")
df <- data.frame(a=1:4, b=c(1.1, 2.1, 3.1, 4.1), d=42:45)
h5write(df, h5fl, "dfcompound")
h5write(df, h5fl, "dfsep", DataFrameAsCompound=FALSE)
H5close()

h5ls(h5fl) ## matr dim: 4 x 3 (known, transparent); dfcompound dim: 3 (opaque)

## issues warning (but not error), but guesses it's doing the right
## thing because length(index) matches dimensional extent of
## dfcompound reported by h5ls
h5read(h5fl, "dfcompound", index=list(2))
h5read(h5fl, "dfsep", index=list(2:3)) ## also wrong

The solution is straightforward; just create a function that manages the fact the columns are written to separate datasets in vector-like fashion:

## allColNames: character(), names of *all* the data.frame's columns
## (in original order), not just those selected
readCompound <- function(file, name, allColNames, index=NULL, ...) {
    rowSubset <- NULL
    colnms <- allColNames
    if( !is.null(index) ) {
        if( !is.null(index[[1]]) )
           rowSubset <- index[[1]]
        if( !is.null(index[[2]]) )
           colnms <- allColNames[index[[2]]]
    }
    coldatasets <- paste(name, colnms, sep="/")

    df <- lapply(coldatasets, function(x) {
        h5read(file, x, index=list(rowSubset), ...)
    })
    names(df) <- colnms
    as.data.frame(df)
}

(res <- readCompound(h5fl, "dfsep", names(df)))
(res <- readCompound(h5fl, "dfsep", names(df), index=list(NULL, 1)))
(res <- readCompound(h5fl, "dfsep", names(df), index=list(2:4, NULL)))
(res <- readCompound(h5fl, "dfsep", names(df), index=list(2:4, 2:3)))

Should be easy to extrapolate writing to such a dataset.

As for adding rows to a data.frame, I suggest creating the datasets independently, but in a way that mirrors the effect of DataFrameAsCompound=FALSE. And perhaps take advantage of creating a dataset with room to grow (via maxdims). But hdf5 represents repeated values pretty efficiently, so growing might not be necessary for your use case.

createDFDataset <- function(file, datasetnm, df, dims=dim(df), maxdims=dim(df)) {
    coldatasetnms <- paste(datasetnm, names(df), sep="/")
    stormodes <- vapply(df, storage.mode, character(1))
    h5createGroup(file, datasetnm)
    success <- vapply(seq_along(df), function(x) {
        h5createDataset(file, coldatasetnms[x], dims, maxdims,
                        storage.mode=stormodes[x])
    }, logical(1))
    if(all(success)) TRUE else names(df)[which(!success)]
}
## maxdims with room to grow
(res <- createDFDataset(h5fl, "thedf", df, c(2, 3), c(8, 3)))
H5close()
h5ls(h5fl, all=TRUE)

Here's a quick illustration of growing a dataset and writing to it:

library(rhdf5)
vec <- sample(1:99, 10000, replace=TRUE)
(h5grow <- tempfile(fileext=".h5"))
h5createFile(h5grow)
h5createDataset(h5grow, "vec", 5000, 10000, storage.mode="integer")
h5write(vec[1:5000], h5grow, "vec")
h5set_extent(h5grow, "vec", 10000)
h5write(vec[5001:length(vec)], h5grow, "vec", index=list(5001:length(vec)))

Let me know if you have any questions.

-Nate