BiocParallel on Windows : cannot find my functions when using SnowParam
1
2
Entering edit mode
@wolfgang-raffelsberger-5096
Last seen 7.8 years ago

Dear list, dear guRus,

first of all, great thanks for all the wonderful packages !

When making code using BiocParallel that should allow some parallel computations on both Linux and Windows I noticed the following surprising behaviour (ultimately creating an error message):

 

Note, at this point I'm using Windows ! When setting/changing BPPARAM from MulticoreParam() to SnowParam() other functions previously declared may not be available any more. This happens only when a new function is declared within the bplapply command, finally an error message will appear.

In the end I'll switch BPPARAM according to the current platform detected as either MulticoreParam or to SnowParam, the rest of the code should remain the same.

 

So the workaround I see so far, consists in avoiding declaring new functions within bplapply() .

However, I thought sharing this (to me quite unexpected) behaviour might be useful on this list.

Any comments/hints ? Am I doing somthing wrong the way I'm calling SnowParam() ?

 

Best greetings,

Wolfgang Raffelsberger

## here an example to illustrate my observations on Windows
library("BiocParallel")
myFun1 <- function(x,val) val+sum(c(x,x^2,x^3))
testMu <- bplapply(1:3,myFun1,val=10,BPPARAM=MulticoreParam(workers=3))                           # OK
testSn <- bplapply(1:3,myFun1,val=10,BPPARAM=SnowParam(workers=3,type="SOCK"))                    # OK

## but
testMu <- bplapply(1:3,function(v) myFun1(v,val=10),BPPARAM=MulticoreParam(workers=3))            # OK
testSn <- bplapply(1:3,function(v) myFun1(v,val=10),BPPARAM=SnowParam(workers=3,type="SOCK"))     # error !

## output of traceback
> traceback(testSn <- bplapply(1:3,function(v) myFun1(v,val=10),BPPARAM=SnowParam(workers=3,type="SOCK")))
Erreur : BiocParallel errors
  element index: 1, 2, 3
  first error: impossible de trouver la fonction "myFun1"

## for completeness - output of sessionInfo
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocParallel_1.8.1

loaded via a namespace (and not attached):
[1] snow_0.4-2     tools_3.3.2    parallel_3.3.2
BiocParallel SnowParam • 5.3k views
ADD COMMENT
1
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States

Linux / MacOS default to MulticoreParam(). Windows doesn't support MulticoreParam(), so defaults to SnowParam().

MulticoreParam() uses a 'shared memory' model where the workers share the memory of the calling parent, so automatically 'know' about functions that are defined in the manager R session.

SnowParam() starts separate processes that do not know about one another at all. It has rules for transferring objects from the manager environment to the worker environment. To understand the rules, one needs to know that every R symbol is defined in an environment, and that environments have 'parent' (possibly empty) environments. Working at the R prompt, one is in the .GlobalEnv environment. The rule is to NOT export symbols in the global environment to the workers. So

register(bpstart(SnowParam(2)))   # active snow cluster for the session
fun1 = function(x) x
result = bplapply(1:2, function(x) fun1(x))

fails -- fun1 is defined in the global environment, but not exported to the worker.

A simple solution is to make sure that the FUN argument to bplapply() references symbols that are either part of base R or are passed in as arguments, so

result = bplapply(1:2, function(x, doit_fun) doit_fun(x), doit_fun=fun1)

works.

A second solution is illustrated by

f = function() {
    fun1 = function(x) x
    bplapply(1:2, function(x) fun1(x))
}
result <- f()

This works, because the rule is that symbols defined in the environment (other than the global environment) where bplapply() is invoked (the body of each function, e.g., f(), represents an environment; the parent of the environment is the environment in which the function was defined, e.g., the parent environment of f() is the global environment) are forwarded to the worker.

The rule about exporting symbols includes parent environments, so

f = function() {
    fun1 = function(x) x
    g = function() {
        bplapply(1:2, function(x) fun1(x))
    }
    g()
}
f()

also works -- bplapply exports the environment g(), and the parent environment of g() (i.e., the environment f()), but not the parent environment of f() (the global environment).

The reason for 'stopping' at the global environment also illustrates a potential hazard. The global environment frequently contains many and sometimes large symbols irrelevant to the calculation, so it would be inefficient to export all of these. Note though that with

f <- function(n) {
    m <- integer(n)
    system.time(bplapply(1:2, function(x) x))
}

have evaluation times

> f(1e6)
   user  system elapsed 
  0.016   0.000   0.093 
> f(1e8)
   user  system elapsed 
  1.052   0.096   1.466 

with the additional cost from sending the (unused) integer vector m to the workers.

The behavior is inherited from the snow and parallel packages, and is not an arbitrary decision  of BiocParallel.

The function bpvalidate() applied to the function used in bplapply() can help spot problematic code.

Cross-platform package developers should test their code using SnowParam(), to ensure that their package works on windows or in a cluster where nodes necessarily do not share memory.

The 'best practice' when implementing functions that use bplapply() is to do as above -- do NOT specify the default parameter BPPARAM, allowing the user to register() or provide their own back-end.

 

ADD COMMENT
0
Entering edit mode

I have been trying to understand and read several posts about sending data (objects, functions, whatever) to workers. And I just can't seem to get it. It seems to be that the way that it is explained always is just impenetrable .... I have read about environments etc. I have a situation where I have a function that uses parallel processing inside it. So obviously you want to pass data, arguments etc from the function call to the workers. I have ended up writing temporary files in the "main" part of the function (with a defined file name) that are loaded in by the workers, but surely this cannot be the optimal way...

ADD REPLY
0
Entering edit mode

start your own question and include a SIMPLE example of what you are trying to do -- the description above isn't enough to understand how to help.

ADD REPLY
1
Entering edit mode

excuse me ,I think I got the same error when using SnowParam,but when I using MulticoreParam that is OK,I have read all the solutions above,but my code is a little complicated,it used lapply twice,so I don't know how to change it into the example style,could help me?Thank you very much!

My code is:

result<-BiocParallel::bplapply(1:length(peakgroup.raw), function(peakgroup.num){
    lapply(1:length(speclib), function(speclib.num){
      PKtoDP(peaktable = peakgroup.raw[[peakgroup.num]],
             peaktable.corrected = peakgroup.corrected[[peakgroup.num]],
             scantime = scantime.ms1,
             speclib.single = speclib[[speclib.num]],
             scan.ms1 = scan.ms1,
             scan.ms2 = scan.ms2,
             ms1ppm = ms1ppm,
             ms2ppm = ms2ppm,
             peakgroup.num = peakgroup.num,  # for plot
             massrange.ms1 = massrange.ms1,
             mcicutoff = cutoff,
             windows = file.windows)
    })
  })

ADD REPLY
0
Entering edit mode

'Forking' (the approach used for parallelism with MulticoreParam()) is not supported on Windows, and the code is evaluated serially where all functions are known.

I guess you have a script that defines a function `foo()`, and another function `bar()` that uses `foo()`

> foo = function() "foo"
> bar = function(i) foo()
> bplapply(1:2, bar, BPPARAM=SnowParam(2))
Error: BiocParallel errors
  element index: 1, 2
  first error: could not find function "foo"

Whereas this works with SerialParam() or (on Linux) MulticoreParam().

> res = bplapply(1:2, bar, BPPARAM=SerialParam())
>

The 'reason' is because SnowParam() creates independent R processes where `foo` is not defined, whereas MulticoreParam() and SerialParam() are using the same R process. SnowParam() doesn't send the .GlobalEnv (the place where foo() is defined) to the workers, but it does send the body of the function where bplapply is used (and so on, up to the global environment) to the worker, so

baz <- function() {
    foo <- function() "hi"
    bar <- function(i) foo()
    bplapply(1:2, bar, BPPARAM=SnowParam(2))
}

works

> res <- baz()
>

Does that help?

ADD REPLY

Login before adding your answer.

Traffic: 610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6