Cluster usage

Cluster Manager

Within Science IT we prefer to use Slurm as our cluster workload manager. The Slurm Workload Manager is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters. It provides three key functions. First, it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending jobs.

Slurm uses a best fit algorithm based on Hilbert curve scheduling or fat tree network topology in order to optimize locality of task assignments on parallel computers.

Logging in on a cluster

In order to log on to a cluster you will need to be "HPC" enabled. This requires your IT-Coordinator to enable the "HPC-flag" on your UZH account in ITIM. When an "HPC-flag" is enabled, we will receive a request for approval. Please create a ticket in our issue tracker to let us know who is requesting access (including the research group you are in), and what you would like to do (very short description of your task that you want to run on the cluster). After being approved, you will need to set your HPC password in the Tivoli Identity Manager.

Now you can log in to the cluster using the SSH command from a terminal command line (with the password you just set):

ssh shortname@cluster.s3it.uzh.ch

Note that the SSH command is already preinstalled on Linux and MacOSX; Microsoft Windows users might want to use PuTTY.

Partitions

Our clusters have been partitioned according to the hardware that is available for the computation, currently there are 3 you are allowed to use: hydra, vesta and largemem

  • hydra: workloads requiring significant (>128GB RAM) memory
  • vesta: GPU related workloads
  • largemem: legacy workloads requiring significant memory (SGI)

For computational workloads requiring less memory/no GPU's, please request access for ScienceCloud.

In order to be able to work in the correct partition you will need to execute the following command:

  • for vesta: module load vesta
  • for hydra: module load hydra
  • for largemem: module load cluster/largemem

Running computational workloads

There are 2 modes of running computational workflows; interactive and non-interactive.

Interactive sessions are meant for minor testing, development and debugging of the job you will want to run. For actual computing, please submit a non-interactive batch job. This will put your command in a queue and execute it when there are enough resources to comply with the request. 

Initially load the correct environment for the specific cluster, hydra as an example:

module load hydra

In order to run a non-interactive batch job, you will need to create a job script. Such a script for hydra may be written as follows:

Create a file named `test.job`, and add the following content:


#!/bin/sh
/bin/hostname
srun -l /bin/hostname
srun -l /bin/pwd

Now when the file is saved, run: `sbatch test.job` on the command line to run this job in non-interactive batch mode. You can check your job in the queue by running `squeue` on the command line.

With batch jobs, you can perform larger/heavier computations than are allowed with interactive jobs. Of course, since batch jobs are non-interactive, they will abort if they try to stop and wait for user input. Note that only scripts can be executed through the sbatch command. If you need to run a binary command, you will need to wrap it into a shell script. Detailed examples of job scripts can be found here. 

To get a shell on the cluster, you would need to request an "interactive job". (Note: interactive jobs are reserved for testing/debugging/development work).
You can start an interactive session on hydra with the following sequence of commands::

module load hydra
srun --pty --time=1:0:0 bash -l

The above `srun` command will request an interactive bash session lasting at most 1 hour (--time=1:0:0).

Note: if you don't set the --time option, you will be kicked out of the interactive session after 1 minute, with a "security warning" message.

Maximum running time

There are 4 different settings regarding maximum running time (QOS). The limit on the maximum amount of time your job (jobs are killed when the time window is exceeded) is specified as follows:

  • normal: 24 hours
  • medium: 48 hours
  • long: 7 days
  • verylong: 28 days

By default the normal limit (QOS) is assigned to all (new) users. You should try to run jobs on the cluster that require less than 24 hours if possible. Long jobs fill up the queue for other users, so they will not be able to run their (small) jobs. When you specify a long running time your job may take longer to start. In order to be able to use the longer running times, please request access via the issue tracker.  

Cluster Storage

Your home filesystem (/home/cluster/<shortname>) has a soft quota of 20GB and is backed up nightly.  You also have a directory in the `/data` filesystem (visible also as the `data` directory in your home), which has a 200GB soft quota but is not backed up.

Larger data should go on /scratch2, which should be thought of as temporary storage.

Further reading

Usefull slurm commands can be found here.

Examples of slurm batch scripts can be found here.

Caveats

Please be aware that the following restrictions apply to the slurm scheduler.

  • Default memory is set to 1 MB. This will probably cause your job to fail unless you specify with the #SBATCH --mem-per-cpu directive the amount of memory (in MB) per core your job would need. e.g. #SBATCH --mem-per-core=4096 will reserve 4G/core.
  • Default execution time is 1s. This will cause your job to fail after 1 second unless you specify with the #SBATCH --time directive the duration of your job. e.g. #SBATCH --time=01:00:00 will grant 1 hour of computation time to the job. Note that the max running time is normally 24 hours; please ask the systems administrators for an extension if you need one.