It looks like your files are quite different in structure from my 'large matrix' example. You have hundreds of small datasets, and h5read()
takes a relatively long time to recurse the file and read them individually.
Here's a quick an dirty function that simplifies the reading for your file type, and performs a fair bit quicker than h5read()
, although still nowhere near as fast as h5load()
. It assumes that you have a flat heirachy in your h5 files and that you don't want anything extra like the attribute data.
library(rhdf5)
library(hdf5)
library(microbenchmark)
h5File <- "paracou-Q-2036-01-00-000000-g01.h5"
f0 <- function() hdf5load(file = h5File, load=FALSE,
verbosity=0, tidy=TRUE)
f1 <- function() h5read(h5File, "/")
h5read_optimised <- function( h5File ) {
dset_names <- h5ls(h5File, recursive = FALSE, datasetinfo = FALSE)$name
fid <- H5Fopen(h5File)
contents <- sapply(dset_names, FUN = function(fid, dset_name) {
did <- H5Dopen(fid, name = dset_name)
res <- H5Dread(did)
H5Dclose(did)
res
}, fid = fid)
H5Fclose(fid)
return(contents)
}
f2 <- function() h5read_optimised(h5File)
> microbenchmark(f0(), f1(), f2(), times = 5)
Unit: milliseconds
expr min lq mean median uq max neval
f0() 47.07749 47.11715 48.1333 47.2934 48.24441 50.93405 5
f1() 1576.38092 1680.43496 1680.1064 1683.5432 1694.62365 1765.54923 5
f2() 540.33196 544.24837 562.4359 553.9433 577.82355 595.83262 5
> identical(f1(), f2())
[1] TRUE
The remaining time diffence is mostly spent by rhdf5 being very careful and checking everytime it is passed a file or dataset handle that the type is correct. For the single large matrix example it only does this once, and spends the rest of the time reading, but for your files it does this check hundreds of times and spends a significant portion of the runtime there.
Here's a version where all the checking is removed, and you're right down at the interface with the C functions. Almost all of this is undocumented, and it's obviously tuned to the structure of the file you shared using the default arguments, but it's pretty quick. Based on this I am wondering whether making this type checking could be optional, so some internal functions can skip it.
h5read_optimised_more <- function( h5file ) {
dset_names <- h5ls(h5File, recursive = FALSE, datasetinfo = FALSE)$name
fid <- H5Fopen(h5File)
dapl <- H5Pcreate("H5P_DATASET_ACCESS")
contents <- sapply(dset_names, FUN = function(fid, dset_name) {
did <- .Call("_H5Dopen", fid@ID, dset_name, dapl@ID, PACKAGE='rhdf5')
res <- .Call("_H5Dread", did, NULL, NULL, NULL, TRUE, 0L, FALSE, PACKAGE='rhdf5')
invisible(.Call("_H5Dclose", did, PACKAGE='rhdf5'))
return(res)
}, fid = fid)
H5Pclose(dapl)
H5Fclose(fid)
return(contents)
}
f3 <- function() h5read_optimised_more(h5File)
> microbenchmark(f0(), f1(), f2(), f3(), times = 5)
Unit: milliseconds
expr min lq mean median uq max neval
f0() 48.79590 49.02661 52.35011 51.39961 52.22597 60.30245 5
f1() 1544.00682 1550.06203 1576.59195 1573.27627 1603.67828 1611.93634 5
f2() 539.84307 562.46172 576.96321 566.65601 598.58858 617.26666 5
f3() 37.99232 38.52886 39.36942 39.39166 40.29512 40.63916 5
> identical(f1(), f3())
[1] TRUE
This isn't a particularly good example because 'toy.h5' is small, but
Leads to
So rhdf5 is faster. More important, though, is that hdf5 reads the data literally, wheres rhdf5 transposes it (this is by design)
So I wonder whether you have a benchmark file that you'd be willing to share?
Yes, sure.
I tested the two functions over the full range of file size and types:
It appears that hdf5 is always faster but for the second type ('S' instead of 'Q' files), the time difference is less critical. Anyway, I must read everything ('S' and 'Q' files) and almost the whole hdf5 content.
Here is a link to download an example of model output file:
https://www.dropbox.com/s/jm1evy3nur8hio7/paracou-Q-2036-01-00-000000-g01.h5?dl=0
Thanks for the follow up with your data I'll take a look at it in the next few days and try to offer some insight. I just did some testing with the code below to read a 'large' integer matrix where rhdf5 seemed to be quicker, so there's definitely a few things to look at. This is with the devel version of rhdf5 that I mentioned.
Out of interest, did you create version 1.6.10 of hdf5 yourself? The last I can find is 1.6.9, but I had to modify the configure script to get it to install.
Thank for the reply and the time you dedicate to my problem. Let me know if you find anything.
No, I did not create the 1.6.10 version of the package. As it was a while ago, I can remember where exactly by I downloaded it online. Maybe here:
https://packages.debian.org/fr/jessie/r-cran-hdf5