Eliminating repetitive calls to ExperimentHub

2

Entering edit mode

slinker ▴ 20

@slinker-15211

Last seen 6.9 years ago

Background: I am creating a new package, let's call it AnalysisPackage to be submitted to bioconductor. AnalysisPackage plots maps of the human brain colored by the enrichment or depletion of gene sets. The maps for each brain region are pretty large. In total, if I were to save all of this in sysdata.rda, the file would be ~25MB. This is far over the 4MB limit for bioconductor packages so I have generated an additional ExperimentData package, lets call it DataPackage which I now call with the ExperimentHub() function.

The problem: There are multiple functions in AnalysisPackage that require data that's stored in DataPackage. There are also nested functions in AnalysisPackage. This means that every time the end user runs a function there are 6-12 repetitive calls to ExperimentHub. This adds a frustrating amount of time to each process.

The question: Is there a way to automatically load the data from DataPackage into memory when the user runs library(AnalysisPackage) so that I don't have to continuously interact with ExperimentHub? Alternatively, has anyone found any different solution around this type of problem? The only thing that I can come up with is passing the data from one function to another, though that would create an unnecessarily large data object for the end user to have to deal with. This doesn't seem like the optimal strategy.

Thanks in advance- Sara

experimenthub annotationhub large dataset • 1.7k views

ADD COMMENT • link 6.9 years ago slinker ▴ 20

3

Entering edit mode

I read your question as trying to avoid the cost of reading the data from disk. One option is to 'memoize' data. A simple example is

> library(memoize)
> f = function(i) { Sys.sleep(i); i }
> fm = memoize(f)
> system.time(fm(1))
   user  system elapsed 
  0.004   0.000   1.002 
> system.time(fm(1))
   user  system elapsed 
  0.024   0.000   0.028 
> system.time(fm(2))
   user  system elapsed 
  0.000   0.000   2.005 
> system.time(fm(2))
   user  system elapsed 
      0       0       0

The idea would be to write an (internal) helper function such as

.hub <- ExperimentHub::ExperimentHub()

.helper <- memoize(function(ehid) {
    .hub[[ehid]]
})

This loads data on first use, so not all users would pay the cost of loading data. One would want to take additional precautions if this were to be used in a parallel evaluation context.

There are likely other approaches, e.g., using an .onLoad() function to load data

.cache <- new.env(parent=emptyenv())
.onLoad <- function(...) {
    hub <- ExperimentHub()
    .cache[["EH123"]] <- hub[["EH123"]]
    ...
}

And then reference .cache[["EH123"]] in your code.

Questions about package development are better addressed to the bioc-devel mailing list.

ADD REPLY • link 6.9 years ago Martin Morgan 25k

0

Entering edit mode

`memoize` is super neat, thanks Martin.

ADD REPLY • link 6.9 years ago Levi Waldron ★ 1.1k

0

Entering edit mode

Yes, memoize is super neat!

ADD REPLY • link 6.9 years ago Lucas Schiffer ▴ 240

0

Entering edit mode

Thank you so much Martin this is a perfect answer! Also I apologize for posting to the wrong list. I'll be sure to post to bioc-devel next time.

ADD REPLY • link 6.9 years ago slinker ▴ 20

Login before adding your answer.