Difference between revisions of "How to Manage Many Jobs"

From HPC users
Jump to navigationJump to search
Line 53: Line 53:
This approach is much cleaner than the loop-based approach before and is recommended for most problems of this nature. However, it should be noted that each job creates a small overhead for scheduling, starting and completing the job. Therefore, individual tasks should run for more than a few minutes (not like the example). Furthermore, you should always make sure not to run too many tasks at the same time on the cluster, e.g. by limiting your array.  
This approach is much cleaner than the loop-based approach before and is recommended for most problems of this nature. However, it should be noted that each job creates a small overhead for scheduling, starting and completing the job. Therefore, individual tasks should run for more than a few minutes (not like the example). Furthermore, you should always make sure not to run too many tasks at the same time on the cluster, e.g. by limiting your array.  


More details about job arrays are described here.
More details about job arrays are described [[SLURM_Job_Management_(Queueing)_System#Job_Arrays|here]].

Revision as of 14:10, 16 August 2019

Introduction

Often, you may need to run many nearly identical jobs. There are different approaches to achieve this goal with minimal effort and some of the approaches will be described below. Which approach best suits your needs depends on the nature of your problem but some hints are given for making the choice.

In the examples below, a simple program to decide whether a number is a prime will be used. The example program can found [media:ManyTasks.tgz here] and in order to use it, you can download it and then

$ tar -zxvf ManyTasks.tgz             # to unpack
$ cd ManyTasks                        # go to directory
$ make                                # build executable

After that, you can run the program, e.g. with

$ ./isprime 73
yes

to see if a number, in this case 73, is a prime (yes it is). The idea of the example is to run isprime on every number in parameter.dat. Therefore, if we want to run this as a job we can think of it as having many jobs that are identical except for one parameter. A single job can also be called a task in this context.

Managing many Tasks in a Single Job

The first approach to run all the required tasks of the example is a single job script. In the job script, we can use a loop to run all the tasks:

#!/bin/bash

### SLURM options (others can be added, e.g. for memory)
#SBATCH --partition carl.p

# loop for tasks (reads parameter.dat line by line into variable p
cat parameter.dat | while read p
do
  echo -n "Testing if $p is prime? "
  ./isprime $p
done

This approach has the disadvantage that only one job is running on the cluster and the tasks are executed in a serial manner. However, in case the indivdual tasks are very short (less than a few minutes maybe) and the number of tasks is not too large (less than 100), this approach might be useful.

Managing many Tasks in a Job Array

Alternatively to a loop in the job script, you could use a loop to submit many individual jobs, one for each task. However, this would put a lot of strain on the job scheduler (which can be reduced with a sleep 1 after each submission) and in fact, SLURM provides job arrays as a better alternative.

To run our example in form of a job array, we can use the job script:

#!/bin/bash

### SLURM options (others can be added, e.g. for memory)
#SBATCH --partition carl.p
#SBATCH --array 1:100

# get paramter from file for each task
p=$(awk "NR==$SLURM_ARRAY_TASK_ID {print $1}" parameter.dat)

# run task
echo -n "Testing if $p is prime? "
./isprime $p

Note, how the SLURM environment variable SLURM_ARRAY_TASK_ID can be used (in combination with awk) to read a certain line from the parameter file. Also, no loop is needed as SLURM is automatically creating individual jobs for each task define by the array-option.

This approach is much cleaner than the loop-based approach before and is recommended for most problems of this nature. However, it should be noted that each job creates a small overhead for scheduling, starting and completing the job. Therefore, individual tasks should run for more than a few minutes (not like the example). Furthermore, you should always make sure not to run too many tasks at the same time on the cluster, e.g. by limiting your array.

More details about job arrays are described here.