SLURM-BASICS

Section: CCB Slurm Basics (7)
Updated:
Index Return to Main Contents

OVERVIEW

Using Slurm allows you to run programs on the entire CCB cluster, not just the login nodes. This can be an advantage in three different ways:

1. You can process data sets by gaining access to systems with up to 1TB of memory. This can help with situations where your program is either running out of memory when you run it or where you receive an e-mail from us to say that you've used too much memory on one of the head nodes.

2. You can run many independent programs at the same time. This might be the same programs on lots of different data sets or multiple different programs (or even, both).

3. You can sometimes get a single program to run faster. However, some programs don't scale well with additional CPU cores so it's not a guarantee.

In a nutshell, Slurm runs jobs in a queue. A job is a single piece of work you ask Slurm to run on your behalf. Everyone’s jobs go in a queue and wait to run. Slurm decides which job goes where in the queue based on the available resources and a fair-share approach. When a job gets to the top of the queue it runs on any one of many different servers, called nodes. You get the resources on the node you asked for (no more, no less), and it runs your job as if it were you, in the path you submitted the job from. Since there are lots of nodes it means that lots of jobs can all run at the same time. When a job finishes, a new job is picked from the queue to run.

HOW WE SCHEDULE JOBS

We aim to apply a fair-use policy to the Slurm queue, in which everyone gets a share of the available cluster time. This is implemented by a script which prioritises jobs based on the following criteria:

: - If you currently have no running jobs, your first pending job will have the highest possible priority. Subsequent jobs in the queue will each have their priority reduced by one.
- If you have jobs running, your first pending job will have a priority equal to the highest possibly prioriy minus the number of jobs you currently have running. Subsequent jobs in the queue will each have their priority reduced by one.

The overall outcome of this is that those who have fewer jobs running will have their priority boosted over those who have more. There are, however, two important caveats:

: - Jobs which request higher numbers of CPU cores and/or greater amounts of memory are significantly harder to schedule, as the queue must wait for a sufficiently large amount of resources to become available. As a result, we highly recommend submitting jobs which use fewer CPU cores for longer amounts of time and which only request as much memory as reasonably necessary.
- We can only offer best-efforts to schedule jobs, and are unable to offer guarantees as to how long they will take to start

CHECKING THE QUEUE

You can check the current status of the queue with the squeue(1) command. It's output will look something like this:

  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
2339119     batch  SNV_pop adibabdu PD       0:00      1 (Resources)
2335042     batch bamcover terenova PD       0:00      1 (Priority)
2330696     batch star_ali herrmann  R   17:37:04      1 cbrgwn015p
2286818     batch average_  hyunlee  R 3-22:54:02      1 cbrgwn004p
2308243     batch E14_20_m oudelaar  R 2-19:00:14      1 cbrgwn019p
2308244     batch E14_21_m oudelaar  R 2-18:54:11      1 cbrgwn015p
2308246     batch E14_23_m oudelaar  R 2-17:59:43      1 cbrgwn016p
2332744     batch ENCLB555  shusain  R   10:56:05      1 cbrgwn017p
2332745     batch ENCLB555  shusain  R   10:54:33      1 cbrgwn006p
2332743     batch ENCLB555  shusain  R   11:00:26      1 cbrgwn002p

Interesting job states are R for running and PD for pending. The last column shows either the node the job is running on, or the reason it's not running (yet). In this example, Resources and Priority are OK - the job is just waiting for either free space to run, or waiting behind a higher priority job.

To see just you own jobs (and not the whole queue), you can use the command:

$ squeue --me

CREATING A JOB SCRIPT

Each job that you want to run is described in a job script. This is used to explain what resources (time, CPU, memory, etc.) you think you will need reserving for your use and also says what you want to do when it gets to the top of the queue and runs. To give an analogy, the job script is a bit like an online order with two parts:

: I want: to run STAR on this data
Deliver it to: 4 CPUs and 10GB memory for 2 hours

For which the job script might look like this:

#!/bin/sh

#Format of --time is DAYS-HOURS:MINUTES:SECONDS
#SBATCH --time=0-02:00:00
#SBATCH --ntasks=4
#SBATCH --mem=10G

module load rna-star

STAR --genomeDir /databank/igenomes/Drosophila_melanogaster/UCSC/dm3/Sequence/STAR/ \
     --outSAMtype BAM SortedByCoordinate \
     --readFilesIn C1_R1_1.fq.gz C1_R1_2.fq.gz \
     --readFilesCommand zcat \
     --outSAMattributes All \
     --outFileNamePrefix C1_R1_ \
     --limitBAMsortRAM 7000000000 \
     --runThreadN 4

Here, we have 3 lines which tell Slurm how much time we think the job will need, how many CPU cores we want, and how much memory we want. After that, it's just the list of commands we want to run when the job starts. Note the the backslash (\) character is Bash's line continuation character and is used for legibility so that lines do not run off the end of the screen.

For more information about estimating the amount of time, CPU and memory your job will need, please refer to profiling(7).

SUBMITTING A JOB

Basic job submission is with the command sbatch(1) , so a simple minimal job submission could be just:

$ sbatch ./jobscript.sh

but you can also specify a partition (queue), number of nodes and amount of memory - if you have not already done so inside the script itself - like so:

$ sbatch -p batch --ntasks=1 --mem=10G ./jobscript.sh

The standard nodes have ~240GB of memory and 24 cores, so asking for 120GB of memory means you'll only get two jobs at a time on the node.

If you'd like to submit a job on the GPU nodes you need to specify two things - the gpu partition and the number of GPUs you need using the --gpus argument. For example:

#SBATCH -p gpu
#SBATCH --gpus=2

in your job script would launch a job with access to 2 GPUs. Please note that most GPU boxes only have 2 GPUs and that the most a single node has is 4, and that the GPUs can be quite heavily used. As such, it's best to ask for only 1 or 2 GPUs if you need your job to run in the foreseeable future.

CANCELLING A JOB

To stop a job , use scancel(1) and the relevant JOBID. For example:

$ scancel 342552

JOB PROFILING

Once you get the hang of running your jobs, you'll notice that the output contains statistics and ASCII charts. This is job profiling information and it's a key tool for optimising your workflow. For more information on job profiling, see profiling(7).

GETTING HELP

You can email the CCB team using the email address help@imm.ox.ac.uk. Using this address ensures your email is logged and assigned a tracking number, and will go to all the core team, which means the appropriate person or people will be able to pick it up.

COPYRIGHT

This text is copyright University of Oxford and MRC and may not be reproduced or redistributed without permission.

AUTHOR

Duncan Tooke <duncan.tooke@imm.ox.ac.uk>