SLURM-BASICS

Section: CCB Slurm Basics (7)
Last updated: Wed 10 Jan 16:53:42 GMT 2024
Index Return to Main Contents

DEPRECATION NOTICE

This CCB documentation has been superseded by our new website: https://lumin.imm.ox.ac.uk

OVERVIEW

Using Slurm allows you to run programs on the entire CCB cluster, not just the login nodes. This can be an advantage in three different ways:

1. You can process data sets by gaining access to systems with up to 1TB of memory. This can help with situations where your program is either running out of memory when you run it or where you receive an e-mail from us to say that you've used too much memory on one of the head nodes.

2. You can run many independent programs at the same time. This might be the same programs on lots of different data sets or multiple different programs (or even, both).

3. You can sometimes get a single program to run faster. However, some programs don't scale well with additional CPU cores so it's not a guarantee.

In a nutshell, Slurm runs jobs in a queue. A job is a single piece of work you ask Slurm to run on your behalf. Everyone's jobs go in a queue and wait to run. Slurm decides which job goes where in the queue based on the available resources and a fair-share approach. When a job gets to the top of the queue it runs on any one of many different servers, called nodes. You get the resources on the node you asked for (no more, no less), and it runs your job as if it were you, in the path you submitted the job from. Since there are lots of nodes it means that lots of jobs can all run at the same time. When a job finishes, a new job is picked from the queue to run.

THE SLURM QUEUES

In order to help ensure that jobs of different sizes all get a fair chance to run on the cluster, JADE has 4 primary queues . The resources assigned to the 4 queues are:

- test: up to 8 cores and 50GB memory, default time and max time 10 minutes

- short: up to 120 cores and 1850GB memory, default time 1 hour, max time 1 day

- long: up to 128 cores and 1900GB memory, default time 1 hour, max time 1 week

- gpu: one node with 4x 24GB GPUs (96GB total), default time 1 hour, max time 1 day

If you have a job which needs more than the maximum allowed time, we'll be asking you to contact us directly to discuss your requirements so that we can balance your request with the needs of other users.

HOW WE PRIORITISE JOBS

We aim to apply a fair-use policy to the Slurm queue, in which everyone gets a share of the available cluster time. This is implemented by a script which prioritises jobs based on the following criteria:

: - If you currently have no running jobs, your first pending job will have the highest possible priority. Subsequent jobs in the queue will each have their priority reduced by one.
- If you have jobs running, your first pending job will have a priority equal to the highest possibly prioriy minus the number of jobs you currently have running. Subsequent jobs in the queue will each have their priority reduced by one.

The overall outcome of this is that those who have fewer jobs running will have their priority boosted over those who have more. There are, however, two important caveats:

: - Jobs which request higher numbers of CPU cores and/or greater amounts of memory are significantly harder to schedule, as the queue must wait for a sufficiently large amount of resources to become available. As a result, we highly recommend submitting jobs which use fewer CPU cores for longer amounts of time and which only request as much memory as reasonably necessary.
- We can only offer best-efforts to schedule jobs, and are unable to offer guarantees as to how long they will take to start

CHECKING THE QUEUE

You can check the current status of the queue with the squeue(1) command. It's output will look something like this:

  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
2339119     batch  SNV_pop adibabdu PD       0:00      1 (Resources)
2335042     batch bamcover terenova PD       0:00      1 (Priority)
2330696     batch star_ali herrmann  R   17:37:04      1 imm-wn5
2286818     batch average_  hyunlee  R 3-22:54:02      1 imm-wn4
2308243     batch E14_20_m oudelaar  R 2-19:00:14      1 imm-wn1
2308244     batch E14_21_m oudelaar  R 2-18:54:11      1 imm-wn5
2308246     batch E14_23_m oudelaar  R 2-17:59:43      1 imm-wn6
2332744     batch ENCLB555  shusain  R   10:56:05      1 imm-wn7
2332745     batch ENCLB555  shusain  R   10:54:33      1 imm-wn6
2332743     batch ENCLB555  shusain  R   11:00:26      1 imm-wn2

Interesting job states are R for running and PD for pending. The last column shows either the node the job is running on, or the reason it's not running (yet). In this example, Resources and Priority are OK - the job is just waiting for either free space to run, or waiting behind a higher priority job.

To see just you own jobs (and not the whole queue), you can use the command:

$ squeue --me

CREATING A JOB SCRIPT

Each job that you want to run is described in a job script. This is used to explain what resources (time, CPU, memory, etc.) you think you will need reserving for your use and also says what you want to do when it gets to the top of the queue and runs. To give an analogy, the job script is a bit like an online order with two parts:

: I want: to run STAR on this data
Deliver it to: 4 CPUs and 10GB memory for 2 hours

For which the job script might look like this:

#!/bin/sh

#Format of --time is DAYS-HOURS:MINUTES:SECONDS
#SBATCH --time=0-02:00:00
#SBATCH --ntasks=4
#SBATCH --mem=10G
#SBATCH --partition=short

module load rna-star

STAR --genomeDir /databank/igenomes/Drosophila_melanogaster/UCSC/dm3/Sequence/STAR/ \
     --outSAMtype BAM SortedByCoordinate \
     --readFilesIn C1_R1_1.fq.gz C1_R1_2.fq.gz \
     --readFilesCommand zcat \
     --outSAMattributes All \
     --outFileNamePrefix C1_R1_ \
     --limitBAMsortRAM 7000000000 \
     --runThreadN 4

Here, we have 4 lines which tell Slurm how much time we think the job will need, how many CPU cores we want, how much memory we want and which queue we want it to run on. If you miss any of these out, you'll get 1 hour, 1 CPU, 15GB memory and the short queue respectively.

After that, it's just the list of commands we want to run when the job starts. Note the the backslash (\) character is Bash's line continuation character and is used for legibility so that lines do not run off the end of the screen.

For more information about estimating the amount of time, CPU and memory your job will need, please refer to profiling(7).

SUBMITTING A JOB

Basic job submission is with the command sbatch(1) , so a simple minimal job submission could be just:

$ sbatch ./jobscript.sh

but you can also specify a partition (queue), number of nodes and amount of memory - if you have not already done so inside the script itself - like so:

$ sbatch --partition=short --ntasks=1 --mem=10G ./jobscript.sh

The standard nodes have 128 cores and 2TB memory (though you can only request up to ~1950GB for a single job).

SUBMITTING AN INTERACTIVE JOB

There may be times when you really need to interactively run some code with access to more CPU and memory, for example when testing a new pipeline. To do so, you request an interactive session on one of the compute nodes using srun(1) :

$ srun --partition=short --cpus-per-task=4 --mem=32G --pty bash -i

In doing so, please be aware that you are taking up space on the cluster which is wasted if you leave it unused. As such, we request that you only request what you'll use and note that interactive CPU jobs are only permitted on the 'test' and 'short' queues. Sessions left idle for extended periods will additionally be terminated without notice.

SUBMITTING A GPU JOB

JADE has a single node containing 4x Nvidia Titan RTX 24GB GPUs - Turing architecture supporting CUDA compute capability 7.5. To submit a job to the GPU nodes you need to specify two additional parameters - the gpu partition and the number of GPUs you need using the --gpus argument. We appreciate that GPU workloads are distinct from CPU so permit both batch and interactive jobs, but we are monitoring use and may impose tighter restrictions on interactive GPU jobs e.g. reduce max time.

Batch job example:

#SBATCH --partition=gpu
#SBATCH --gpus=2
...

Interactive job example:

$ srun --partition=gpu --gpus=2 ...

You can confirm GPU acquisition by running nvidia-smi from within the GPU node. It will list accessible GPUs and their utilisation - in example below I requested 2 GPUs.

$ nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA TITAN RTX               Off | 00000000:1E:00.0 Off |                  N/A |
| 35%   44C    P0              66W / 280W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN RTX               Off | 00000000:1F:00.0 Off |                  N/A |
| 21%   42C    P0              39W / 280W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Please note that there are only 4 GPUs, so please be considerate in how many you request. Typically this is determined by how much total GPU memory you need. And confirm that your code is running on GPU not CPU, with nvidia-smi.

CUDA driver 12.2 is preinstalled. To load a CUDA compiler simply load one of our CUDA modules:

$ module load cuda

CANCELLING A JOB

To stop a job , use scancel(1) and the relevant JOBID. For example:

$ scancel 342552

JOB PROFILING

Once you get the hang of running your jobs, you'll notice that the output contains statistics and ASCII charts. This is job profiling information and it's a key tool for optimising your workflow. For more information on job profiling, see profiling(7).

GETTING HELP

You can email the CCB team using the email address help@imm.ox.ac.uk. Using this address ensures your email is logged and assigned a tracking number, and will go to all the core team, which means the appropriate person or people will be able to pick it up.

COPYRIGHT

This text is copyright University of Oxford and MRC and may not be reproduced or redistributed without permission.

AUTHOR

Duncan Tooke <duncan.tooke@imm.ox.ac.uk>