Question

HDF5 reading speed

1

Entering edit mode

felicien.meunier ▴ 40

@felicienmeunier-14964

Last seen 7.0 years ago

For research purposes, I must read a lot of large hdf5 files generated by an environmental model. Right now, I open them in rstudio using the rhdf5 package. I recently noticed that the most time-limiting step was the reading of the hdf5 files and so I wanted to speed up that part of the code. Switching from rhdf5 to the old hdf5 package, the speed was considerably reduced (it goes something like ten times faster). Is there any reasonable explanation that explains such a speed difference? I would prefer to keep rhdf5 as it is still supported, so is there a way to reach better performances? As it could be due to some bad practice, please find below the two lines of code I use to open and read the hdf5 files:

With rhdf5(slow)

mymont    = h5read(h5file,'/')

and with hdf5 (fast):

mymont    = hdf5load(file=h5file,load=FALSE,verbosity=0,tidy=TRUE)

rhdf5 hdf5 • 5.9k views

ADD COMMENT • link updated 7.2 years ago by Mike Smith ★ 6.6k • written 7.2 years ago by felicien.meunier ▴ 40

1

Entering edit mode

This isn't a particularly good example because 'toy.h5' is small, but

fl = system.file(package="HDF5Array", "extdata", "toy.h5")
f0 = function() hdf5load(fl, load=FALSE, verbosity=0, tidy=TRUE)
f1 = function() h5read(fl, "/")
microbenchmark(f0(), f1(), times=5)

Leads to

> microbenchmark(f0(), f1(), times=5)
Unit: milliseconds
 expr      min       lq      mean   median       uq       max neval cld
 f0() 93.48307 94.90350 100.62300 95.84296 95.87789 123.00760     5   b
 f1() 70.55737 70.90299  72.31741 72.23494 72.69283  75.19891     5  a

So rhdf5 is faster. More important, though, is that hdf5 reads the data literally, wheres rhdf5 transposes it (this is by design)

> str(f1())
List of 2
 $ M1: num [1:10000, 1:150] 0.44 0.189 0.992 0.526 0.624 ...
 $ M2: num [1:150, 1:200] 9.033 0.56 8.109 5.852 -0.821 ...
> str(f0())
Dotted pair list of 2
 $ M2: num [1:200, 1:150] 9.033 -0.837 -3.066 14.924 -4.724 ...
 $ M1: num [1:150, 1:10000] 0.44 0.023 0.117 0.549 0.525 ...

So I wonder whether you have a benchmark file that you'd be willing to share?

> packageVersion("rhdf5")
[1] '2.23.5'
> packageVersion("hdf5")
[1] '1.6.9'

ADD REPLY • link 7.2 years ago Martin Morgan 25k

2

Entering edit mode

Yes, sure.

I tested the two functions over the full range of file size and types:

library(rhdf5)
library(hdf5)
library(microbenchmark)

direct = '/home/femeunier/Documents/ED2/ED/run/paracou/analy/'
direct2= '/home/femeunier/Documents/ED2_archives/Marcos/ED2/ED/run/amazon/nolianas/'
  
small_fl = file.path(direct,'paracou-Q-2000-03-00-000000-g01.h5')
normal_fl = file.path(direct,'paracou-Q-2036-01-00-000000-g01.h5')
S_fl = file.path(direct2,'amazon-S-2029-07-01-000000-g01.h5')

f0 = function() hdf5load(small_fl, load=FALSE, verbosity=0, tidy=TRUE)
f1 = function() h5read(small_fl, "/")

> microbenchmark(f0(), f1(), times=5)
Unit: milliseconds
 expr        min         lq       mean     median         uq        max neval
 f0()   25.24533   26.10443   27.34176   26.78865   28.15434   30.41607     5
 f1() 1063.33951 1108.48518 1166.39708 1176.59489 1200.20679 1283.35904     5

f2 = function() hdf5load(normal_fl, load=FALSE, verbosity=0, tidy=TRUE)
f3 = function() h5read(normal_fl, "/")

> microbenchmark(f2(), f3(), times=5)
Unit: milliseconds
 expr        min         lq       mean     median         uq        max neval
 f2()   35.69882   36.25049   46.51748   36.59257   38.60065   85.44488     5
 f3() 1083.67681 1106.77546 1148.52053 1116.24633 1164.70662 1271.19742     5

f4 = function() hdf5load(S_fl, load=FALSE, verbosity=0, tidy=TRUE)
f5 = function() h5read(S_fl, "/")

> microbenchmark(f4(), f5(), times=5)
Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
 f4() 269.1774 279.5624 296.0794 292.2530 316.0172 323.3869     5
 f5() 714.0194 726.4895 775.2907 742.2052 769.6752 924.0642     5

It appears that hdf5 is always faster but for the second type ('S' instead of 'Q' files), the time difference is less critical. Anyway, I must read everything ('S' and 'Q' files) and almost the whole hdf5 content.

Here is a link to download an example of model output file:

https://www.dropbox.com/s/jm1evy3nur8hio7/paracou-Q-2036-01-00-000000-g01.h5?dl=0

> packageVersion("rhdf5")
[1] ‘2.22.0’
> packageVersion("hdf5")
[1] ‘1.6.10’

ADD REPLY • link 7.2 years ago felicien.meunier ▴ 40

2

Entering edit mode

Thanks for the follow up with your data I'll take a look at it in the next few days and try to offer some insight. I just did some testing with the code below to read a 'large' integer matrix where rhdf5 seemed to be quicker, so there's definitely a few things to look at. This is with the devel version of rhdf5 that I mentioned.

Out of interest, did you create version 1.6.10 of hdf5 yourself? The last I can find is 1.6.9, but I had to modify the configure script to get it to install.

library(rhdf5)
library(hdf5)
library(microbenchmark) 

## large matrix
dat <- matrix(1L:50000000L, ncol = 500)

## specify file
h5File <- tempfile(fileext = ".h5")
if(file.exists(h5File))
    file.remove(h5File)

## open file
fid <- H5Fcreate(name = h5File)

## create dataset with GZIP compression
sid <- H5Screate_simple(dims = dim(dat), maxdims = dim(dat))
pid <- H5Pcreate(type = "H5P_DATASET_CREATE")
H5Pset_deflate( pid, level = 6 )
H5Pset_chunk(pid, c(100, 100))
did <- H5Dcreate(h5loc = fid, name = "blosc", 
                 dtype_id = "H5T_NATIVE_INT32",
                 sid, dcpl = pid)
H5Dwrite(did, buf = dat)

## tidy up
H5Dclose(did)
H5Pclose(pid)
H5Sclose(sid)
H5Fclose(fid)

## define reading functions
f0 <- function() hdf5load(file = h5File, load=FALSE, verbosity=0, tidy=TRUE)
f1 <- function() h5read(h5File, "/")

> microbenchmark(f0(), f1(), times=5)
Unit: seconds
 expr      min      lq     mean   median       uq      max neval
 f0() 1.645665 1.68502 1.707856 1.720116 1.728094 1.760384     5
 f1() 1.012661 1.03006 1.043414 1.041873 1.046944 1.085533     5

> packageVersion("rhdf5")
[1] '2.23.6'
> packageVersion("hdf5")
[1] '1.6.9'

ADD REPLY • link 7.2 years ago Mike Smith ★ 6.6k

0

Entering edit mode

Thank for the reply and the time you dedicate to my problem. Let me know if you find anything.

No, I did not create the 1.6.10 version of the package. As it was a while ago, I can remember where exactly by I downloaded it online. Maybe here:

https://packages.debian.org/fr/jessie/r-cran-hdf5

ADD REPLY • link 7.2 years ago felicien.meunier ▴ 40

2

Entering edit mode

Mike Smith ★ 6.6k

@mike-smith

Last seen 13 hours ago

EMBL Heidelberg

Interesting question, I've never really compared the various HDF5 reading packages against each other. Since they all rely on the same underlying C library, my assumption was that read speeds would be comparable between them, and factors like compression levels, chunk size, and disk I/O would determine performance - maybe this is not the case. Given that last update to the hdf5 package was in 2009 and it's no longer on CRAN it definitely wasn't on my radar.

That said, I did recently find some small performance bottlenecks in rhdf5, but they reduced read times by more like 25% rather than many times faster. You can see if they help your use case by trying the version in the devel branch of Bioconductor.

As far as your commands go, they are fine if you want to read the entire file. If you only need a subset of the data then it is almost certainly quicker to use something like h5readDataset() to extract only the parts that you need. Similarly, if you're accessing the same file multiple times, it can be quicker to use the 'low-level' functions and open it once using H5Fopen() followed by H5Dread() on the parts you need. That really depends what exactly you're doing - it won't be beneficial for a single read of the whole file.

ADD COMMENT • link 7.2 years ago Mike Smith ★ 6.6k

0

Entering edit mode

As said above, I need to read almost everything in these files. That is why I use hdf5load and h5read.

Thanks for the tip, I will try the devel branch version of the rhdf5 package and let you know the potential changes.

EDIT: somehow I could not install the develop version:

> biocLite('rhdf5_2.23.5.tar.gz',type="source")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).
Installing package(s) ‘rhdf5_2.23.5.tar.gz’
inferring 'repos = NULL' from 'pkgs'
ERROR: dependency ‘Rhdf5lib’ is not available for package ‘rhdf5’
* removing ‘/home/femeunier/R/x86_64-pc-linux-gnu-library/3.4/rhdf5’
Warning message:
In install.packages(pkgs = doing, lib = lib, ...) :
  installation of package ‘rhdf5_2.23.5.tar.gz’ had non-zero exit status

ADD REPLY • link 7.2 years ago felicien.meunier ▴ 40

1

Entering edit mode

The devel version has changed structure and I've moved the underlying C library into a separate package called Rhdf5lib. You should be able to install that the usual way from Bioconductor, but you can also install devel versions of both from my Github account with:

BiocInstaller::biocLite('grimbough/Rhdf5lib')
BiocInstaller::biocLite('grimbough/rhdf5')

ADD REPLY • link 7.2 years ago Mike Smith ★ 6.6k

0

Entering edit mode

Ok, I could install it but it did not change much as compared to the optimized solution you suggested above.

ADD REPLY • link 7.2 years ago felicien.meunier ▴ 40

score 4 · Accepted Answer · 2018-02-08

It looks like your files are quite different in structure from my 'large matrix' example. You have hundreds of small datasets, and h5read() takes a relatively long time to recurse the file and read them individually.

Here's a quick an dirty function that simplifies the reading for your file type, and performs a fair bit quicker than h5read(), although still nowhere near as fast as h5load() . It assumes that you have a flat heirachy in your h5 files and that you don't want anything extra like the attribute data.

library(rhdf5)
library(hdf5)
library(microbenchmark)

h5File <- "paracou-Q-2036-01-00-000000-g01.h5"

f0 <- function() hdf5load(file = h5File, load=FALSE, 
                          verbosity=0, tidy=TRUE)
f1 <- function() h5read(h5File, "/")

h5read_optimised <- function( h5File ) {

    dset_names <- h5ls(h5File, recursive = FALSE, datasetinfo = FALSE)$name

    fid <- H5Fopen(h5File)
    contents <- sapply(dset_names, FUN = function(fid, dset_name) {
        did <- H5Dopen(fid, name = dset_name)
        res <- H5Dread(did)
        H5Dclose(did)
        res
    }, fid = fid)
    H5Fclose(fid)

    return(contents)
}

f2 <- function() h5read_optimised(h5File)

> microbenchmark(f0(), f1(), f2(), times = 5)
Unit: milliseconds
expr        min         lq      mean    median         uq        max neval
f0()   47.07749   47.11715   48.1333   47.2934   48.24441   50.93405     5
f1() 1576.38092 1680.43496 1680.1064 1683.5432 1694.62365 1765.54923     5
f2()  540.33196  544.24837  562.4359  553.9433  577.82355  595.83262     5
> identical(f1(), f2())
[1] TRUE

The remaining time diffence is mostly spent by rhdf5 being very careful and checking everytime it is passed a file or dataset handle that the type is correct. For the single large matrix example it only does this once, and spends the rest of the time reading, but for your files it does this check hundreds of times and spends a significant portion of the runtime there.

Here's a version where all the checking is removed, and you're right down at the interface with the C functions. Almost all of this is undocumented, and it's obviously tuned to the structure of the file you shared using the default arguments, but it's pretty quick. Based on this I am wondering whether making this type checking could be optional, so some internal functions can skip it.

h5read_optimised_more <- function( h5file ) {

    dset_names <- h5ls(h5File, recursive = FALSE, datasetinfo = FALSE)$name

    fid <- H5Fopen(h5File)
    dapl <- H5Pcreate("H5P_DATASET_ACCESS")
    contents <- sapply(dset_names, FUN = function(fid, dset_name) {
        did <- .Call("_H5Dopen", fid@ID, dset_name, dapl@ID, PACKAGE='rhdf5')
        res <- .Call("_H5Dread", did, NULL, NULL, NULL, TRUE, 0L, FALSE, PACKAGE='rhdf5')
        invisible(.Call("_H5Dclose", did, PACKAGE='rhdf5'))
        return(res)
    }, fid = fid)

    H5Pclose(dapl)
    H5Fclose(fid)

    return(contents)
}

f3 <- function() h5read_optimised_more(h5File)

> microbenchmark(f0(), f1(), f2(), f3(), times = 5)
Unit: milliseconds
 expr        min         lq       mean     median         uq        max neval
 f0()   48.79590   49.02661   52.35011   51.39961   52.22597   60.30245     5
 f1() 1544.00682 1550.06203 1576.59195 1573.27627 1603.67828 1611.93634     5
 f2()  539.84307  562.46172  576.96321  566.65601  598.58858  617.26666     5
 f3()   37.99232   38.52886   39.36942   39.39166   40.29512   40.63916     5
 > identical(f1(), f3())
[1] TRUE