Usage

Introduction and quick start

This section walks you through a typical session on "Vesta", for more general information regarding cluster usage click here.

Logging in on a cluster

In order to log on to a cluster you will need to be "HPC" enabled. This requires your IT-Coordinator to enable the "HPC-flag" on your UZH account in ITIM. When an "HPC-flag" is enabled, we will receive a request for approval. Please create a ticket in our issue tracker to let us know who is requesting access (including the research group you are in), and what you would like to do (very short description of your task that you want to run on the cluster). After being approved, you will need to set your HPC password in the Tivoli Identity Manager.

Now you can log in to the cluster using the SSH command from a terminal command line (with the password you just set):

ssh shortname@cluster.s3it.uzh.ch

Note that the SSH command is already preinstalled on Linux and MacOSX; Microsoft Windows users might want to use PuTTY.

Select a partition to work in

The cluster has multiple partitions: once you are logged in, you should first of all select which partition you would like to work in. For the "vesta" cluster you would select "vesta", and normally the "cuda" module. Selection of a partition is done with the command module load::

module load vesta cuda

For more information regarding cluster partitions, click here.

Running computational workloads

There are 2 modes of running computational workflows; interactive and non-interactive.

Interactive sessions are meant for minor testing, development and debugging of the job you will want to run. For actual computing, please submit a non-interactive batch job. This will put your command in a queue and execute it when there are enough resources to comply with the request. 

Initially load the correct environment for the specific cluster, hydra as an example:

module load hydra

In order to run a non-interactive batch job, you will need to create a job script. Such a script for hydra may be written as follows:

Create a file named `test.job`, and add the following content:


#!/bin/sh
/bin/hostname
srun -l /bin/hostname
srun -l /bin/pwd

Now when the file is saved, run: `sbatch test.job` on the command line to run this job in non-interactive batch mode. You can check your job in the queue by running `squeue` on the command line.

With batch jobs, you can perform larger/heavier computations than are allowed with interactive jobs. Of course, since batch jobs are non-interactive, they will abort if they try to stop and wait for user input. Note that only scripts can be executed through the sbatch command. If you need to run a binary command, you will need to wrap it into a shell script. Detailed examples of job scripts can be found here. 

To get a shell on the cluster, you would need to request an "interactive job". (Note: interactive jobs are reserved for testing/debugging/development work).
You can start an interactive session on hydra with the following sequence of commands::

module load hydra
srun --pty --time=1:0:0 bash -l

The above `srun` command will request an interactive bash session lasting at most 1 hour (--time=1:0:0).

Note: if you don't set the --time option, you will be kicked out of the interactive session after 1 minute, with a "security warning" message.

More advanced usage of batch jobs

Vesta can be accessed for computation through the SLURM scheduler. 

However in the daily usage and for a normal interaction you only need to know three basic commands:

  • sbatch - submit a batch script
  • squeue - check the status of jobs on the system
  • scancel - delete one of your jobs from the queue

All the commands above when issued with the --help option provide useful usage information. More information on these commands can be found here.

A detailed Slurm user guide can be found here.

Creating job scripts

Vesta is normally used in a batch mode. For this you need to create a script with the commands you want to run. The first line specifies the command interpreter (normally bash). Programs should be run using the `srun` command. The following is a simple example.

#!/bin/bash
srun hostname

Use the `sbatch` command to submit the script for execution. If the above script was called job.sh, you would use:

sbatch job.sh

Qualifiers can be used to the `sbatch` command to control job execution. At the very minimum you must specify how long the job will run, and request a GPU device. For example,

sbatch --time=10:00:00 --gres gpu:Tesla-K80:1 job.sh

It is often more convenient to include these options directly in the batch file. This is done with a SBATCH comment like the following.

#!/bin/bash
#SBATCH --time=10:00:00 --gres gpu:Tesla-K80:1
srun hostname

Now you can simply use:

sbatch job.sh

and the options from the batch file will be used. If the same option appears on the command line, and in the batch file then the command line value is used.

More detailed examples regarding job scripts can be found here.

Caveats

Please be aware that the following restrictions apply to the slurm scheduler.

  • You must request a GPU device, otherwise your job will not have access to one.
  • Default execution time is 1s. This will cause your job to fail after 1 second unless you specify with the #SBATCH --time directive the duration of your job. e.g. #SBATCH --time=01:00:00 will grant 1 hour of computation time to the job. Note that the max running time is normally 24 hours; please ask the systems administrators for an extension if you need one.
  • For more advanced requirements (multiple GPU devices or extra memory), please contact us directly so we may advise you.