How to Manage Many Jobs

From HPC users
Jump to navigationJump to search

Introduction

Often, you may need to run many nearly identical jobs. There are different approaches to achieve this goal with minimal effort and some of the approaches will be described below. Which approach best suits your needs depends on the nature of your problem but some hints are given for making the choice.

In the examples below, a simple program to decide whether a number is a prime will be used. The example program can found here and in order to use it, you can download it and then

$ tar -zxvf ManyTasks.tgz             # to unpack
$ cd ManyTasks                        # go to directory
$ make                                # build executable

After that, you can run the program, e.g. with

$ ./isprime 73
yes

to see if a number, in this case 73, is a prime (yes it is). The idea of the example is to run isprime on every number in parameter.dat. Therefore, if we want to run this as a job we can think of it as having many jobs that are identical except for one parameter. A single job can also be called a task in this context.

Managing many Tasks in a Single Job

The first approach to run all the required tasks of the example is a single job script. In the job script, we can use a loop to run all the tasks:

#!/bin/bash

### SLURM options (others can be added, e.g. for memory)
#SBATCH --partition carl.p

# loop for tasks (reads parameter.dat line by line into variable p
cat parameter.dat | while read p
do
  echo -n "Testing if $p is prime? "
  ./isprime $p
done

This approach has the disadvantage that only one job is running on the cluster and the tasks are executed in a serial manner. However, in case the indivdual tasks are very short (less than a few minutes maybe) and the number of tasks is not too large (less than 100), this approach might be useful. Note, that in a bash-script you can also use for-loops, e.g.

...
for ((i=1; i<=100; i++))
do
  # use $i to e.g. read a line from parameter.dat
  ... 
done 

Managing many Tasks in a Job Array

Alternatively to a loop in the job script, you could use a loop to submit many individual jobs, one for each task. However, this would put a lot of strain on the job scheduler (which can be reduced with a sleep 1 after each submission) and in fact, SLURM provides job arrays as a better alternative.

To run our example in the form of a job array, we can use the job script:

#!/bin/bash

### SLURM options (others can be added, e.g. for memory)
#SBATCH --partition carl.p
#SBATCH --array 1:100

# get paramter from file for each task
p=$(awk "NR==$SLURM_ARRAY_TASK_ID {print \$1}" parameter.dat)

# run task
echo -n "Testing if $p is prime? "
./isprime $p

Note, how the SLURM environment variable SLURM_ARRAY_TASK_ID can be used (in combination with awk) to read a certain line from the parameter file. Also, no loop is needed as SLURM is automatically creating individual jobs for each task define by the array-option.

This approach is much cleaner than the loop-based approach before and is recommended for most problems of this nature. However, it should be noted that each job creates a small overhead for scheduling, starting and completing the job. Therefore, individual tasks should run for more than a few minutes (not like the example), if needed you can combine the job array and loop approach to combine some of the tasks into a single job array task. Furthermore, you should always make sure not to run too many array tasks at the same time on the cluster, e.g. by limiting your array.

More details about job arrays are described here.

Managing Many Tasks with the Parallel Command

The third approach uses the shell tool GNU Parallel which is available as a module on the cluster, e.g.:

$ module load hpc-env/6.4
$ module load parallel

The parallel-command can be used for executing tasks in parallel with the tool managing the available resources. Within a SLURM job, it can be combined with srun to execute the tasks on a subset of the resources allocated for the job. To achieve this, we first should define a bash-script for a single task that takes one or more arguments. For our example, we can write prime_task.sh as follows

#!/bin/bash
#
# script to run a single task to test number given as arg $1 for being prime
# additional random wait time to mimick different run times

# wait for a random time (10-20s)
twait=$((($RANDOM % 11) + 10))
sleep $twait

# get paramter from file based on argument $1
p=$(awk "NR==$1 {print \$1}" parameter.dat)

# test p for being prime
echo -n "Testing if $p is prime? "
./isprime $p

We have expanded the example a little bit by adding a random wait time of 10 to 20 seconds to mimick different run times for each task, but otherwise it is the same script as we used in the job array approach. We can run this script, e.g.

$ chmod a+x prime_task.sh
$ ./prime_task.sh 80
Testing if 73 is prime? yes

The output appears after a few seconds due to the extra sleep-command. Now we can use parallel to execute e.g. a total of 20 tasks (given as the range {1..20} with 10 tasks running concurrently (option -j <ntasks>) with the command:

$ parallel -N 1 -j 10 --joblog parallel.log ./prime_task.sh {1} ::: {1..20}
Testing if 70 is prime? no
Testing if 7 is prime? yes
Testing if 69 is prime? no
...

In this command, the number of tasks is determined by the range given after the :::-separator. Each value from the given range is passed as argument {1} to the shell script (the option -N 1 tells parallel to pass one argument to each task). With the option --joblog you will get a log file with information about the execution of each task. Please refer to the parallel documentation (e.g. the man pages) for more examples on how to use the tool.

To use parallel within a job script, we can combine it with srun to distribute the tasks on the allocated resources. The following job script gives zou a basic example how to do this:

#!/bin/bash

# SLURM options
#SBATCH --partition carl.p
#SBATCH --nodes 2                # specify the number of nodes
#SBATCH --tasks-per-node 24      # and use all cores on each node

# Define srun arguments:
srun="srun -n1 -N1 --exclusive"
# --exclusive     ensures srun uses distinct CPUs for each job step
# -N1 -n1         allocates a single core to each task

# Define parallel arguments:
parallel="parallel -N 1 --delay .2 -j $SLURM_NTASKS --joblog parallel_job.log"
# --delay .2        prevents overloading the controlling node on short jobs
# --resume          add if needed to use joblog to continue an interrupted run (job resubmitted)

# Run the tasks for 100 lines in parameter.dat
$parallel "$srun ./prime_task.sh {1}" ::: {1..100}

The advantage of this approach is that you can run your tasks in a single job with less overhead compared to a job array (important if the tasks are short-running) and you can make efficient use of the allocated resources (as parallel can handle different run times of the tasks). It can also be more friendly to other users in certain situations (large job arrays sometimes tend to fill up the cluster).

This guide uses material from [1], also note that the use of parallel should be cited, see

$ parallel --citation

Memory Allocation

As you may have noticed, there was no memory request included in the job script above. The reason is simply, that by requesting --tasks-per-node 24 (the maximum number of task/CPU cores) you automatically get also all the available memory on that node, because the default memory is set to this value. However, in case you need to explicitly need to or want to request memory, you must use e.g.

#SBATCH --mem-per-cpu 2G

and not the --mem-option. The reason here is, that the memory request is passed also to the srun starting the tasks and with --mem every task would use all the allocated memory on each node. As a result, only a single task per node is executed at any given time.

If you use the --mem-per-cpu-option, you also need to make sure that the total requested memory is available on a single node. For example, if you ask for 20G in the partition mpcs.p, you can only have --tasks-per-node 12. Luckily, sbatch should return an error when the requested node configuration is not available.

Long Run Times

In some situations, the total run time of a job using the parallel command can become quite long. For example, if a single task takes one day to complete, and you want to run 1000 tasks using

 #SBATCH --ntasks 40

Then the total run time of this job would be 25 days. The problem here is that the time limit on the cluster is only 21 days, so this job could not start. Of course, a simple solution is to increase ntasks to 50, which reduces the run time to 20 days. If you want to stick with ntasks=40 then you can only split the job into two jobs with one job using the range {1..500} and the other using {501..1000}. This has the advantage, that both jobs finish in about 13 days, and if enough resources are available, they might even run in parallel.

The splitting of a single job with the parallel command into two or more jobs can be somehow automatized using a job array again. This might be confusing at first as the main idea of the parallel command is to avoid very big job arrays, but in this case, the job array will be much smaller, and individual array jobs will be larger. Assuming you want to split the job above into two jobs, you can first add the line

 #SBATCH --array 0-1              # if you want N array jobs, range should be 0-(N-1)

Here, it is important that the array starts with the ID 0 and the final ID should be N-1. Next, you can add a few lines that take care of calculating the range for the parallel command

 # Define range for array task
 MINTASK=1                     # begin of original range
 MAXTASK=100                   # end of original range
 TASKS_PER_ARRAY_JOB=$(((MAXTASK-MINTASK+1)/(SLURM_ARRAY_TASK_MAX+1)))
 PARALLEL_BEGIN=$((SLURM_ARRAY_TASK_ID*TASKS_PER_ARRAY_JOB+1))
 PARALLEL_END=$((PARALLEL_BEGIN+TASKS_PER_ARRAY_JOB-1))
 PARALLEL_RANGE=$(seq $PARALLEL_BEGIN $PARALLEL_END)

This computes how many tasks are done in each of the array jobs and sets the range based on the ID of the array job. The calculation assumes that the division is without rest. For more complicated situation a different approach, e.g. reading ranges from a file, might be better. Finally, we can adjust the parallel command to

 parallel="parallel -N 1 --delay .2 -j $SLURM_NTASKS --joblog parallel_job_${SLURM_ARRAY_TASK_ID}.log"

to make sure each array job uses a different log-file and then execute the tasks in parallel

 $parallel "$srun ./task.sh {1}" ::: $PARALLEL_RANGE

With this approach a very large job array can be turned into a much smaller job array that utilizes a fixed number of nodes and for which the run time can be adjusted easily.