Question

SGE backend on Bioc AMI with Starcluster not recognising CPUs

1

Entering edit mode

n.huckle ▴ 30

@nhuckle-8805

Last seen 7.4 years ago

Germany

Hello

I am working with a slightly customized Bioconductor AMI (Version 3.1), where I installed my own packages on. I am trying to create a bigger cluster - 50 spot-instances with 32 CPUs (c3.x8large)- on Amazon AWS (region: EU Ireland) with help of the pre-installed Starcluster and the parallel backends described in BiocAMI . The problem is, that it is not working.

Three backend options are described on the help page of the Bioconductor AMI and I am having problems with all of them, most importantly the SGE backend as I intended to use it

All of the following problems can be obtained with trying to execute the minimal examples described on the help page (see hyperlink above), yet using instances that have more than one CPU.

MPI: Described as not working "rstudio initialization error: unable to connect to service" after logging in to the Master node's Rstudio Server's login page
SSH: Returning an "system2" error when using "makeSSHWorker(nodename="nameofnode"), which I traced back to the function "runOScommandlinux".
SGE: It is working, yet apparently does not recongize the CPUs which I specify with
param <- BatchJobsParam(50, resources=list(ncpus=32))
The reason I believe this, is a) the missing performance increase of using 50*32=1600 parallelized nodes and b) observing instance performance workload in the AWS console, I can see that only a small part of the instances CPU capacity is used.

Especially regarding the SGE backend, I would appreciate information or help. Have I reached a limit with this many instances and nodes? Does anyone have experience with this?

Thank you very much for any help in advance.

Kind regards,

Nikolai

ami starcluster biocparallel sge AWS • 2.1k views

ADD COMMENT • link 8.7 years ago n.huckle ▴ 30

1

Entering edit mode

Hi Nikolai,

I am looking into this. I started investigating issue 2 with the ssh clusters and have found the problem, but not sure yet what the solution is.

As for SGE I am not sure that looking at performance workload in the AWS console is the best way to determine whether all cores are being used.

What if instead you use the example on the AMI page (which calls system("hostname") on each node)but replacing the configuration of param with the way you are already configuring it and changing 1:100 to 1:1600.

If things are working correctly you should see a list of 50 nodes with 32 jobs run on each one.

This doesn't tell us precisely if the jobs really used all cores on each node, I guess (but it at least tells us if each node in the cluster was used) -- for that you might need to know more about SGE than I do, perhaps under SGE each worker (that is, combo of node and CPU) has a unique ID that could be printed out? Anyone know?

ADD REPLY • link 8.7 years ago Dan Tenenbaum ★ 8.2k

1

Entering edit mode

OK, I have more info on the issue with ssh clusters. It has to do with the fact that the BatchJobs package is installed onto the AMI in a non-standard library directory. Then BatchJobs tries to ssh to each node in the cluster and call R to determine the location of BatchJobs on that node so that it can run a helper script. However, when you run a command on a remote machine with ssh, (in contrast to starting an interactive session) it does not read config files (such as ~/.bashrc) that set up your environment. So R can't find BatchJobs and everything fails.

The fix is for me to generate the AMIs going forward with BatchJobs installed in the default library directory. I will do this for the BioC 3.2 AMI after 3.2 is released on October 14 and for all new AMIs after that. I won't do it for old ones. And it sounds like you have already customized the AMI for your own needs, so here is how you can work around this issue:

- Start your AMI outside of StarCluster, either with the AWS console or using the aws command line tool.

- ssh to the instance you have started (as the ubuntu user) and issue these commands:

sudo R --vanilla

And then, in R:

install.packages("BatchJobs", repos="http://cran.rstudio.com/")

That will install BatchJobs in the default library location.

Then you can stop the instance and create a new AMI from it (then terminate the instance). Note the AMI ID and replace the AMI ID in your StarCluster config file with the new AMI ID.

Then you should be able to use an ssh cluster. If you run into any issues, post them here.

ADD REPLY • link 8.7 years ago Dan Tenenbaum ★ 8.2k

1

Entering edit mode

Hi Nikolai,
Here are the steps I followed to get a function running via BiocParallel (using SGE) . Please try to mimic this and tell us if you’re getting expected output or what’s failing:

# Make a directory to contain a Python virtual environment :
mkdir starcluster-experiment
cd starcluster-experiment
# Create a virtual environment for StarCluster
virtualenv venv
# Activate the virtualenv 
source venv/bin/activate

# Install StarCluster
sudo easy_install StarCluster

# Modify StarCluster configuration ( ~/.starcluster/config ) with the appropriate values.
vim ~/.starcluster/config

#
# I only needed to modify the following lines: 
#
# Format is <line number="">:<text> 

4:[global]
8:DEFAULT_TEMPLATE=smallcluster
17:
21:[aws info]
25:AWS_ACCESS_KEY_ID = << CHANGE TO YOUR ACCESS KEY ID >>
26:AWS_SECRET_ACCESS_KEY= << CHANGE TO YOUR SECRET ACCESS KEY >>
28:AWS_USER_ID= << CHANGE TO YOUR USER ID, SHOULD BE A NUMBER NOT A STRING OF CHARACTERS >>

# Note the name you gave your keypair from https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#KeyPairs:sort=desc:keyName 
# You should be using your ssh key ( ~/.ssh/id_rsa ) and you should have uploaded your public key to AWS ( ~/.ssh/id_rsa.pub )

49:[key keypair-brian]
50:KEY_LOCATION=~/.ssh/id_rsa
51:
72:
73:[cluster smallcluster]
75:KEYNAME = keypair-brian
77:CLUSTER_SIZE = 2
79:CLUSTER_USER = ubuntu
82:CLUSTER_SHELL = bash
87:DNS_PREFIX = True

# This may be a discrepancy, I used the devel version of Bioconductor.  Via: http://www.bioconductor.org/help/bioconductor-cloud-ami/#ami_ids 
93:NODE_IMAGE_ID = ami-1f2fe074

96:NODE_INSTANCE_TYPE = m3.medium

# Ensure StarCluster can use HTTP
129:permissions = http
138:
206:
208:[permission http]
209:IP_PROTOCOL = tcp
210:FROM_PORT = 80
211:TO_PORT = 80
212:



# Once all of that is configured, run the starcluster:
starcluster start smallcluster	

# Prove that the cluster is properly configured 
starcluster listclusters

# Your output should be like this, but it should show 50 nodes 
(venv)blong@work:~/Documents/Work/REPOS__git/b-long/starcluster-experiment$ starcluster listclusters
StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

-----------------------------------------------
smallcluster (security group: @sc-smallcluster)
-----------------------------------------------
Launch time: 2015-10-06 14:37:10
Uptime: 0 days, 02:08:32
Zone: us-east-1e
Keypair: keypair-brian
EBS volumes: N/A
Cluster nodes:
    smallcluster-master running i-e738fc45 ec2-54-158-152-91.compute-1.amazonaws.com
    smallcluster-node001 running i-e838fc4a ec2-54-145-219-171.compute-1.amazonaws.com
Total nodes: 2


# Connect to the master node as the Ubuntu user
starcluster sshmaster --user=ubuntu smallcluster

# You should be able to start R and run the following (although be sure to use 32 CPUs rather than 1) : 
ubuntu@smallcluster-master:~$ R

> library(BatchJobs)
> library(BiocParallel)
> param <- BatchJobsParam(2, resources=list(ncpus=1))
> register(param)
> FUN <- function(i) system("hostname", intern=TRUE)
> xx <- bplapply(1:100, FUN)
> table(unlist(xx))


Output should be similar to : 

> library(BatchJobs)
Loading required package: BBmisc
Sourcing configuration file: '/home/ubuntu/R-libs/BatchJobs/etc/BatchJobs_global_config.R'
Sourcing configuration file: '/home/ubuntu/.BatchJobs.R'
BatchJobs configuration:
  cluster functions: SGE
  mail.from: 
  mail.to: 
  mail.start: none
  mail.done: none
  mail.error: none
  default.resources: 
  debug: FALSE
  raise.warnings: FALSE
  staged.queries: TRUE
  max.concurrent.jobs: Inf
  fs.timeout: NA

> library(BiocParallel)
'BiocParallel' did not register default BiocParallelParams:
  invalid class “SnowParam” object: 'workers' must be integer(1) and >= 0
> param <- BatchJobsParam(2, resources=list(ncpus=1))
> register(param)
> FUN <- function(i) system("hostname", intern=TRUE)
> xx <- bplapply(1:100, FUN)
SubmitJobs |+++++++++++++++++++++++++++++++++++++++++++++++++| 100% (00:00:00)
Waiting [S:0 D:100 E:0 R:0] |++++++++++++++++++++++++++++++++++| 100% (00:00:00)0)

> table(unlist(xx))

 smallcluster-master smallcluster-node001 
                  50                   50

Except, that your final table should include 50 entries rather than 2

ADD REPLY • link 8.7 years ago brian.long ▴ 10

0

Entering edit mode

Hi Brian,

thank you very much for your reply. I will definitely try out your solution the next time I am working with my StarCluster+Bioconductor setup and report any issues.

When I wrote my post I was under a bit of time pressure, so that I had to implement a dirty workaround to get it to work with the SGE backend. This workaround is described in my reply to Dan's comment.

ADD REPLY • link 8.7 years ago n.huckle ▴ 30

1

Entering edit mode

Hi Dan,

thank you very much for your help. I will try out a SSH cluster, the next time I am working with my StarCluster+Bioconductor setup, and report any issues.

Regarding the SGE issue I reported above: I was in a lot of hurry to finish my simulation on that day, as they were part of my now finished Bachelor Thesis and I was way behind on schedule already. So I implemented an ad hoc version in which I made the BatchJobs parameter connect to only one core on all of the 50 instances and then on those 50 cores (each on a seperate instance) start a function which uses the foreach (from the foreach package) function made to work on the 32 cores on every machine. Here is an example code to understand it better:

library(BatchJobs)
library(BiocParallel)
param <- BatchJobsParam(50, resources=list(ncpus=1))
FUN <- function(X) {
  library(foreach)
  library(doParallel)
  cl<-makeCluster(detectCores())
  registerDoParallel(cl)
  answer2 <- foreach(1:X} %dopar% {
     ### Actual Calculations
}}
answer <- bplapply(1:1600, FUN, BPPARAM=param)

Of course, I made sure that detectCores() actually detected all of the 32 cores on the 50 machines - which it did.

To my suprise it worked. Some hasty benchmarking showed that it was significantly faster than just using

param <- BatchJobsParam(50, resources=list(ncpus=32))

and the CPU workload was at 100%. All this information is a bit subjective and not 100% indicative that all the 1600 cores are working, but it had to work for me and it did. One definit caveat lies in the functionality of

BatchJobsParam(50, resources=list(ncpus=32))

As if it does not work as intended in recognising the 32 cores on the 50 machines, I cannot guarantee that it works perfectly with my ad hoc setup. Not that I advise anyone to use it.

ADD REPLY • link 8.7 years ago n.huckle ▴ 30