SLURM Job Management (Queueing) System

From HPC users
Jump to navigationJump to search

The new system that will manage the user jobs on CARL and EDDY will be SLURM (formally known as Simple Linux Utility for Resource Management).

Slurf Workload Manager.png

SLURM is a free and open-source job scheduler for Linux and Unix-like kernels and is used on about 60% of the world's supercomputers and computer clusters. If you used the job scheduler of FLOW and HERO (Sun Grid Engine or SGE), it will be easy to get used to SLURM because the concept of SLURM is quite similar.

SLURM provides three key functions:

  • it allocates exclusive and/or non exclusive acces to resources (computer nodes) to users for some duration of time so they can perform work
  • it provides a framework for starting, executing and monitoring work (typically a parallel job on a set of allocated nodes
  • it arbitrates contetion of resources by managing a queue of pending work


Submitting Jobs

The following lines of code are an example job script. All it does is generating randoms numbers, saves them in random_numbers.txt and sorts them afterwards.

#!/bin/bash

#SBATCH --nodes=1                    
#SBATCH --ntasks=1                  
#SBATCH --mem=2G                  
#SBATCH --time=0-2:00                
#SBATCH --output=slurm.%j.out        
#SBATCH --error=slurm.%j.err          
#SBATCH --mail-type=END,FAIL         
#SBATCH --mail-user=your.name@uol.de 

for i in {1..100000}; do
echo $RANDOM >> random_numbers.txt
done

sort random_numbers.txt

This sbatch script (or "job script") is used to set general options for sbatch. It has to contain options preceded with "#SBATCH" before any executable commands.
Please notice, that commands must not be called before any #SBATCH option! Elsewise, slurm might start proceeding the commands without processing the options correctly which can lead to different time limits, or other unwanted node allocations.

To submit your job you have to use following command:

sbatch -p carl.p myscript.job (if your script is named "myscript", of course)

You have to add a partition to the sbatch-command (with "-p"). For tutorial purposes we are using the "carl"-partition, if you are submitting real jobs you should always specify a fitting partition for your needs. You can see all possible partitions with the command

sinfo

Further information about the command "sinfo" can be found here: sinfo

Information on sbatch-options

The options in the example script shown above are common and should be used in all of your scripts (except the mail option).

--nodes=<minnodes[-maxnodes]> or -N
With this option you are requesting the nodes needed for your job. You can specify a minimum and maximum amount of nodes. If you only specify one number its used as both the minimum and maximum node count. If your node-limit defined in the job script is outside of the range permitted for its associated partition, the job will be left in a PENDING state. If it exceeds the actual amount of configured nodes in the partition, the job will be rejected.
ntasks=<number> or -n
Instead of launching tasks, sbatch requests an allocation of resources and submites a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default value for ntasks is 1.
--mem=<MB>
You can specify the real memory required per node in megabytes. To keep the numbers small you can use the suffixes K (kb), M (Mb), G (Gb) and T (Tb), e.g. --mem=2G for 2GB memory.
Important Note: its no longer possible to add floating numbers like e.g. --mem=6.4GB to your jobscript. You would now convert it to MB: --mem=6400M.
--mem-per-cpu=<size[units]>
With this parameter you can specify the minimum memory required per allocated CPU. Default units are megabytes.
--time= or -t
Use this to set a limit on the total runtime of the job allocation. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
--output=<filename pattern> or -o and error=<filename pattern> or -e
By default, both standard output and standard error are directed to the same file. When using this option, you instruct Slurm to connect the batch script's standard output and standard error directly to the file name specified in the "filename pattern". The default file name is "slurm-%j.out" respectively "slurm-%j.err", where the %j is replaced by the job ID.
--mail-type=<type>
Its possible to inform the user by email if certain event types occur. Valied type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL, STAGE_OUT, TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), TIME_LIMIT_50 (reached 50 percent of time limit) and ARRAY_TASKS (send emails for each array task). Multiple type values may be specified by separating them with commas.
--mail-user=<user>
Define a user to receive email notification of state changes as defined by --mail-type.
--gres=<list>
Specifies a comma delimited list of generic consumable resources. The format of each entry on the list is "name[[:type]:count]". The specified resources will be allocated to the job on each node.

For example:

--gres=gpu:1 - This will request one GPU per Node.
--gres=gpu:2 - This will request two GPUs per Node.


This is just a small part of all possibe options. A complete list with explanations can be found on the slurm homepage.

Job Arrays

SLURM also allows the use of job array, i.e. jobs that are executed multiple times and, if resources are available, in parallel. Each job works on an independent task, the typical situation is a parameter study. A job array is created with the SLURM option

#SBATCH --array=1-100:2%10    # alternative -a <array-specs> for short

where in this example the job creates 50 array tasks with task ids 1, 3, ..., 97, 99. A maximum of 10 tasks is executed in parallel here.

More general, the array is specified with the <array-specs> which have the format

<range>:<step>%<tasklimit>

where <range> is e.g. 1-100, 100-199 or 4 (short for 4-4), <step> is the increment and <joblimit> the limit of tasks running in parallel. The latter can be used to avoid blocking too many resources with a single job array. Be kind to others! Multiple ranges can be combined in a comma-separated list (1-9:2,2-10%2 instead of 1-10).

Note:: SLURM restricts the size of job arrays and also the maximum value in a range to 1000. This is a rather severe limit for some users, but there are some workarounds.

When working with an array, the variable $SLURM_ARRAY_TASK_ID will be set for each single task. This way you have a changing variable for each task, which you can use to influence the parameters you use to perform calculations within the script.

An array script could look like this:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=1G
#SBATCH --time=0-2:00   
#SBATCH --output=slurm.%A_%a.out   # %A becomes the job ID, %a becomes the array index
#SBATCH --error=slurm.%A_%a.err
#SBATCH --array=1-20%5   # Loop 20 times and compute 5 tasks in parallel

# Your calculations 
# for each task

sleep $SLURM_ARRAY_TASK_ID   # the task's work is to sleep for a number of seconds based on the task's


In the scirpt above, a very simple example for the actual work done in each task is used. Here are a few examples, how $SLURM_ARRAY_TASK_ID can be used in more complex ways:

  • If you have to run your program process_file on a large number of input files (e.g. with names like filename011.inp), you can create these numbered filenames based on the task id with
input_file=$(printf "filename%3.3i" $SLURM_ARRAY_TASK_ID) 
process_file < $input_file


  • Integer operation can be done natively in bash, e.g. if you want to set an input parameter to 5, 10, 15, .., 100, you can simply use
parameter=$((SLURM_ARRAY_TASK_ID*5)) # with task ids 1,2,..,20
mycode -p $parameter                 # run mycode with parameter 


  • Floating-point operation can be done with a trick using awk (there other ways, too), so if the parameter should be 0.05, 0.10, 0.15, ..., 1.00, use
fltpar=$(echo $SLURM_ARRAY_TASK_ID $SLURM_ARRAY_TASK_MAX | awk '{printf "%.2f", $1/$2}') # task id becomes percentage
mycode -f $fltpar                                                                        # run mycode with parameter 


  • If you have a file, which contains a line with parameters for each run you want to do, you can also use awk
parameter=$(awk "NR==$SLURM_ARRAY_TASK_ID {print $2}" parameter.dat)
mycode -p $parameter

This takes the parameter from the second column in parameter.dat, if you want the whole line you can print $0.

Possible source of error

Not just in job scripts, but in scripts in general there is one particular error which our support team encounters every once in a while: Faulty scripts that print out errors when passed to bash. This is mostly the case when scripts are written on Windows systems and then transferred to the cluster system. In this cases, the most probable source of error would be that the file is stored in a way that Linux cant process. Should you encounter an error like the following, the data type is very likely unprocessable for bash:
slurmstepd: error: execve():bad interpreter(/bin/bash): no such file or directory

To solve this, the simple command dos2unix should be used on the corresponding script file. With file you can check if the file is dos based and if it changed after the process:

$ file test_win.sh
test_win.sh: Bourne-Again shell script, ASCII text executable, with CRLF line terminators   # the highlighted part indicates a wrong file format
$ dos2unix test_win.sh
dos2unix: converting file test_win.sh to Unix format...
$ file test_win.sh
test_win.sh: Bourne-Again shell script, ASCII text executable

Now you should be able to run the script with bash / slurm.

Documentation

If you want to learn more about the SLURM Management System you can visit the documentation page on the official homepage of SLURM.