Question

trigger package fails at parallalization - Transcriptional Regulatory Inference from Genetics of ExpRession

0

Entering edit mode

affennacken • 0

@affennacken-6905

Last seen 10.5 years ago

Netherlands

Dear Bioconductor Community,

the reference manual (October, 21, 2014) of the bioconductor trigger package states that it is doing calculations in parallel at least on large datasets (p 11: trigger.mlink-methods; p 12: trigger.net-method), which makes sense because a large number of permutations may be involved. I cannot get parallel processing running, neither on the minimal example provided below, nor on larger datasets. As seen in the example above, I am using doMC in order to mediate parallelization. Should I install a different parallelization package other than doMC? Do I somehow interpret the reference manual the wrong way? Or is the trigger package buggy in that sense?

Help is greatly appreciated,
Kind regards,

Jonas

No parallel processing is achieved using the following code:

library(doMC) library(trigger) ## registering multiple cores registerDoMC(cores = 4) ## loading trigger accompanied data: data(yeast) attach(yeast) ## sample gene indexes to idx set.seed(666) idx <- c(unique(sort(sample(1:nrow(exp), size = 150, replace = F)),383,590,5003,4949)) my_trigger <- trigger.build(exp = exp[idx,], exp.pos = exp.pos[idx,], marker=marker, marker.pos = marker.pos) my_loclink <- trigger.loclink(my_trigger, window.size = 30000) my_mlink <- trigger.mlink(my_loclink, B = 100,seed = 666)

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] doMC_1.3.3 iterators_1.0.7 foreach_1.4.2 trigger_1.10.0
[5] qtl_1.33-7 corpcor_1.6.7

loaded via a namespace (and not attached):
[1] codetools_0.2-9 qvalue_1.38.0 sva_3.10.0 tcltk_3.1.1
[5] tools_3.1.1

trigger parallel processing multicore • 1.2k views

ADD COMMENT • link updated 10.5 years ago by Valerie Obenchain ★ 6.8k • written 10.5 years ago by affennacken • 0

score 1 · Answer 1 · 2014-10-23

Hi Jonas,

Functions in the trigger package are not themselves run in parallel. I believe the authors intended that 'idx' would be used as the chunking argument to a parallel function outside the package. You can do this with doMC, you just need a foreach object and evaluation with %dopar%.

library(doMC)
cores <- 4
registerDoMC(cores = cores)
...
...

The gene index should be a list. For this example I'll split into approximately equal groups across the number of workers.

nrows <- nrow(my_loclink@exp)
idx <- split(seq_len(nrows), ceiling(seq_len(nrows)/(nrows/cores)))

> length(idx)
[1] 4
> elementLengths(idx)
 1  2  3  4 
37 38 37 38

Create a foreach object and R expression then evaluate them with %dopar%.

res <- foreach(i = idx) %dopar% {
    trigger.mlink(my_loclink, B=100, i=i, seed=666) }
> res <- foreach(i = idx) %dopar% {
+     trigger.mlink(my_loclink, B=100, i=i, seed=666) }
Error in { : 
  task 1 failed - "Please select at least 100 genes to compute multi-locus linkage for them"

Looks like we need at least 100 genes in each list element for a user-supplied 'idx'. This data set is small, only 150 genes, so we'll fake it just to demonstrate the parallel example.

idx <- list(1:100, 1:100)
res <- foreach(i = idx) %dopar% {
    trigger.mlink(my_loclink, B=100, i=i, seed=666) }

4 cores were specified but the list is length 2 so we only see 2 workers working ...

> res <- foreach(i = idx) %dopar% {
+     trigger.mlink(my_loclink, B=100, i=i, seed=666) }
[1] Start to calculate multi-locus linkage statistics ...
[1] Start to calculate multi-locus linkage statistics ...
[1] 10% completed
[1] 10% completed
[1] 20% completed
[1] 20% completed
[1] 30% completed
[1] 30% completed
...

and the result -

> res
[[1]]
*** TRIGGER object *** 
Marker matrix with  3244 rows and  112 columns 
Expression matrix with  150 rows and  112 columns 

[[2]]
*** TRIGGER object *** 
Marker matrix with  3244 rows and  112 columns 
Expression matrix with  150 rows and  112 columns

Another option for parallel work is the BiocParallel package.

library(BiocParallel)

Multicore, Snow and BatchJobs backends are supported. We'll use Multicore since you were using doMC.

Register a MulticoreParam with 4 workers.

register(MulticoreParam(workers = 4))

BiocParallel has a family of bp*apply functions that are based on lapply(), sapply(), mapply() etc. but are run in parallel. bplaply() is similar to lapply(); the first argument is a list and each element is passed to FUN.

Create the FUN to be run on each worker.

FUN <- function(i) 
    trigger.mlink(my_loclink, B=100, i=i, seed=666)

Execute bplapply():

res <- bplapply(idx, FUN=FUN)

and we get the same result -

> res
[[1]]
*** TRIGGER object *** 
Marker matrix with  3244 rows and  112 columns 
Expression matrix with  150 rows and  112 columns 

[[2]]
*** TRIGGER object *** 
Marker matrix with  3244 rows and  112 columns 
Expression matrix with  150 rows and  112 columns

Valerie