SLURM Job Management (Queueing) System

From HPC users
Jump to navigationJump to search

The new system that will manage the user jobs on CARL and EDDY will be SLURM (formally known as Simple Linux Utility for Resource Management).

Slurf Workload Manager.png

SLURM is a free and open-source job scheduler for Linux and Unix-like kernels and is used on about 60% of the world's supercomputers and computer clusters. If you used the job scheduler of FLOW and HERO (Sun Grid Engine or SGE), it will be easy to get used to SLURM because the concept of SLURM is quite similar.

SLURM provides three key functions:

  • it allocates exclusive and/or non exclusive acces to resources (computer nodes) to users for some duration of time so they can perform work
  • it provides a framework for starting, executing and monitoring work (typically a parallel job on a set of allocated nodes
  • it arbitrates contetion of resources by managing a queue of pending work


Submitting Jobs

The following lines of code are an example job script. All it does is generating randoms numbers, saves them in random_numbers.txt and sorts them afterwards.

#!/bin/bash

#SBATCH --nodes=1                    
#SBATCH --ntasks=1                  
#SBATCH --mem=2G                  
#SBATCH --time=0-2:00                
#SBATCH --output=slurm.%j.out        
#SBATCH --error=slurm.%j.err          
#SBATCH --mail-type=END,FAIL         
#SBATCH --mail-user=your.name@uol.de 

for i in {1..100000}; do
echo $RANDOM >> random_numbers.txt
done

sort random_numbers.txt

This sbatch script (or "job script") is used to set general options for sbatch. It has to contain options preceded with "#SBATCH" before any executable commands.

To submit your job you have to use following command:

sbatch -p carl.p myscript.job (if your script is named "myscript", of course)

You have to add a partition to the sbatch-command (with "-p"). For tutorial purposes we are using the "carl"-partition, if you are submitting real jobs you should always specify a fitting partition for your needs. You can see all possible partitions with the command

sinfo

Further information about the command "sinfo" can be found here: sinfo

Information on sbatch-options

The options in the example script shown above are common and should be used in all of your scripts (except the mail option).

--nodes=<minnodes[-maxnodes]> or -N
With this option you are requesting the nodes needed for your job. You can specify a minimum and maximum amount of nodes. If you only specify one number its used as both the minimum and maximum node count. If your node-limit defined in the job script is outside of the range permitted for its associated partition, the job will be left in a PENDING state. If it exceeds the actual amount of configured nodes in the partition, the job will be rejected.
ntasks=<number> or -n
Instead of launching tasks, sbatch requests an allocation of resources and submites a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default value for ntasks is 1.
--mem=<MB>
You can specify the real memory required per node in megabytes. To keep the numbers small you can use the suffixes K (kb), M (Mb), G (Gb) and T (Tb), e.g. --mem=2G for 2GB memory.
Important Note: its no longer possible to add floating numbers like e.g. --mem=6.4GB to your jobscript. You would now convert it to MB: --mem=6400M.
--mem-per-cpu=<size[units]>
With this parameter you can specify the minimum memory required per allocated CPU. Default units are megabytes.
--time= or -t
Use this to set a limit on the total runtime of the job allocation. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
--output=<filename pattern> or -o and error=<filename pattern> or -e
By default, both standard output and standard error are directed to the same file. When using this option, you instruct Slurm to connect the batch script's standard output and standard error directly to the file name specified in the "filename pattern". The default file name is "slurm-%j.out" respectively "slurm-%j.err", where the %j is replaced by the job ID.
--mail-type=<type>
Its possible to inform the user by email if certain event types occur. Valied type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL, STAGE_OUT, TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), TIME_LIMIT_50 (reached 50 percent of time limit) and ARRAY_TASKS (send emails for each array task). Multiple type values may be specified by separating them with commas.
--mail-user=<user>
Define a user to receive email notification of state changes as defined by --mail-type.
--gres=<list>
Specifies a comma delimited list of generic consumable resources. The format of each entry on the list is "name[[:type]:count]". The specified resources will be allocated to the job on each node.

For example:

--gres=gpu:1 - This will request one GPU per Node.
--gres=gpu:2 - This will request two GPUs per Node.


This is just a small part of all possibe options. A complete list with explanations can be found on the slurm homepage.

Job Arrays

SLURM also allows the use of job array, i.e. jobs that are executed multiple times and, if resources are available, in parallel. Each job works on an independent task, the typical situation is a parameter study. A job array is created with the SLURM option

#SBATCH --array=1-100:2%10    # alternative -a <array-specs> for short

where in this example the job creates 50 array tasks with task ids 1, 3, ..., 97, 99. A maximum of 10 tasks is executed in parallel here.

More general, the array is specified with the <array-specs> which have the format

<range>:<step>%<tasklimit>

where <range> is e.g. 1-100, 100-199 or 4 (short for 4-4), <step> is the increment and <joblimit> the limit of tasks running in parallel. The latter can be used to avoid blocking too many resources with a single job array. Be kind to others! Multiple ranges can be combined in a comma-separated list (1-9:2,2-10%2 instead of 1-10).

Note:: SLURM restricts the size of job arrays and also the maximum value in a range to 1000. This is a rather severe limit for some users, but there are some workarounds.

When working with an array, the variable $SLURM_ARRAY_TASK_ID will be set for each single task. This way you have a changing variable for each task, which you can use to influence the parameters you use to perform calculations within the script.

An array script could look like this:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=1G
#SBATCH --time=0-2:00   
#SBATCH --output=slurm.%A_%a.out   # %A becomes the job ID, %a becomes the array index
#SBATCH --error=slurm.%A_%a.err
#SBATCH --array=1-20%5   # Loop 20 times and compute 5 tasks in parallel

# Your calculations 
# for each task

sleep 1   # give the scheduler a bit time to work.

Documentation

If you want to learn more about the SLURM Management System you can visit the documentation page on the official homepage of SLURM.