Question

Incremental Write

0

Entering edit mode

kpalmer ▴ 10

@57aa324d

Last seen 2.2 years ago

Canada

Hey all, This seems like a very basic question but I'm struggling to find the answer. I'd like to create a loop that appends data to an HDF5 file.

The background is that I'm looking to store large spectrograms (up to 150k columns and many ~183600 rows) and then be able to calculate summary stats over certain times and frequencies. In creating these spectrograms, audio data are loaded into r, the fft is created and the power spectral density is calculated over each one minute period. I'd like to just append this to an existing deployment data using something like the following.

For example:

# Create the  hdf5 file 
h5createFile("AudioRecorder.h5")
# create group for location 1
h5createGroup("AudioRecorder.h5", "location1")

# iterate thought the 'wav' files, here just simulated
for(ii in 1:30){

  # time and frequency limits- time is a positctx but for irrelevant for the example
  timeVals = (1:10)
  freqVals =1:75000

  # Simulated spectrogram data- 10 min 75k columns
  simAPSD<-matrix(rnorm(10*75000), ncol=75000)

  # Add or append the data
  h5write(
    simAuido,
    file = "AudioRecorder.h5",
    paste('location1',"simAPSD",sep="/"))

  h5write(
    as.character(timeVals),
    file = "AudioRecorder.h5",
    paste('location1',"time",sep="/"))

  print(ii)

}

I was expecting something like h5wite(... append=TRUE) but this doesn't seem to be the case. Thanks in advance for your help!

append rhdf5 • 1.3k views

ADD COMMENT • link 2.2 years ago kpalmer ▴ 10

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

Things may have changed since I did this, but in my experience you have to instantiate an HDF5 object with the expected number of columns and rows, and then you can dump data in. Here's a function I wrote back in the day to make an HDF5-backed SummarizedExperiment with data I couldn't read in all at once. Different use case, but you can see the broad outlines I would imagine.

makeSE.hdf5 <- function(kgranges, funcgranges, gtexgranges, baselinefile, dir = "my_h5_se",
                        fnames, startover = TRUE, startwith = NULL){
    require("SummarizedExperiment")
    require("HDF5Array")
    if(!file.exists(dir)) dir.create(dir)
    fn <- paste(dir, "assays.h5", sep = "/")
    rr <- unlist(as(kgranges, "GRangesList"))
    basefiles <- scan(baselinefile[[1]], "c", nlines = 1, sep = "\t")[-(1:5)]
    if(!file.exists(fn) || startover){
        if(startover) unlink(fn)
        h5createFile(fn)
        flen <- sum(sapply(kgranges, length))
        baselinewidth <- length(basefiles)
        ## have to add one because we create the union column
        fwid <- length(funcgranges) + length(gtexgranges) + baselinewidth + 1
        cat("Creating an HDF5 file, dimension", flen, "x", fwid, "\n")
        h5createDataset(fn, "assay1", c(flen, fwid), storage.mode = "logical", chunk = c(1000,fwid))
    }
    if(!startover && is.null(startwith))
        stop(paste("If not starting over, remember to supply the original kgranges object with",
                   "a startwith value that represents the kgranges list item to start with."),
             call. = FALSE)
    NAMES <- do.call(c, lapply(kgranges, names))
    if(!startover){
        dontuse <- 1:(startwith - 1)
        firstrow <- sum(sapply(kgranges[dontuse], length)) + 1
        kgranges <- kgranges[-dontuse]
    } else {
        firstrow <- 1
    }
    for(i in seq(along = kgranges)){
        mat <- populateMatrix(kgranges[[i]], olaplst)
        mat2 <- populateMatrix(kgranges[[i]], gtexgranges, TRUE)
        mat3 <- addInBaseline(kgranges[[i]], baselinefile[[i]])
        mat <- cbind(mat, mat2, mat3)
        cat(paste("Running", unique(as.character(seqnames(kgranges[[i]])))),
            paste("Starting at", firstrow),
            paste("Ending at", (firstrow+nrow(mat)-1)),sep = "\n")
        h5write(mat, fn, "assay1", FALSE, index = list(firstrow:(firstrow+nrow(mat)-1), 1:ncol(mat)))
        cat(paste("Finished writing", unique(as.character(seqnames(kgranges[[i]]))), "to disk"), "\n")
        firstrow <- firstrow + nrow(mat)
        rm(mat)
        gc()
        H5close()
    }
    coldat <- DataFrame(Path = c(dirname(fnames), "Internally_generated",
                                 rep(dirname(baselinefls)[1], length(basefiles))),
                        Source = rep(c("GTEx","LDSC"), c(length(fnames)+1, length(basefiles))),
                        Filename = c(basename(fnames), "Internally_generated", basefiles))
    rownames(coldat) <- coldat$Filename
    out <- SummarizedExperiment(assays = HDF5Array("my_h5_se/assays.h5", "assay1"),
                                colData = coldat,
                                rowRanges = rr)
    names(out) <- NAMES
    ## Save it and output
    out@assays <- SummarizedExperiment:::.shorten_h5_paths(out@assays)
    saveRDS(out, file.path(dir, "se.rds"))
    ## we don't return anything - this function is just to generate, and the
    ## file can then be opened using loadHDF5SummarizedExperiment
}

ADD COMMENT • link 2.2 years ago James W. MacDonald 68k

score 3 · Accepted Answer · 2023-02-12

@james-w-macdonald-5106 has it right. If you know the final size of the matrix you want to create, the easiest approach is to let HDF5 know the final size of the dataset at creation time, and then fill in the content incrementally.

Here's a slightly modified example from your data, which will loop 3 times and insert each 10 x 75,000 matrix into a pre-created 30 x 75,000 matrix. You can modify the strategy if I've got the append direction incorrect:

setwd(tempdir())
library(rhdf5)

h5createFile("AudioRecorder.h5")
h5createGroup("AudioRecorder.h5", "location1")

h5createDataset(file = "AudioRecorder.h5", dataset = "/location1/simAPSD", 
                ## set the dimensions to be our known final matrix size
                dims = c(3 * 10, 75000), 
                ## we should create the dataset with some chunks.  This will make it
                ## much faster to access the data later.  I've gone for chunks based
                ## on the input datasize, but they can be anything if other size are more suitable
                chunk = c(10, 7500), 
                ## state we're going to write floating-point values 
                storage.mode = "double")

for(ii in 1:3){

  timeVals = (1:10)
  freqVals = 1:75000

  simAPSD<-matrix(rnorm(10*75000), ncol=75000)

  h5write(
    simAPSD,
    file = "AudioRecorder.h5",
    name = paste("location1/simAPSD"), 
    ## I'm using the start and count argument to indicate where the data should be inserted.
    ## In this case "start" will be rows 1, 11, 21 through the 3 iteration, always column 1
    ## "count" is how many rows should be added, which is the same as the size of our input
    ## You can also use "index" instead of "start+count" here in a more traditional R-style approach.
    start = c(((ii-1)*nrow(simAPSD))+1, 1),
    count = c(nrow(simAPSD), ncol(simAPSD))
  )

  print(ii)

}

Having made my first comment about this being the most straightforward strategy, I'll also add that it is possible to dynamically grow HDF5 datasets if you don't know the final size when you begin writing. However the code is more involved than I've shown here. If that's the case with your example, please reply and I'll try to provide an example where we can grow the dataset as needed.