Difference between revisions of "How to Share Fair"

From HPC users
Jump to navigationJump to search
(Created page with "== Introduction == The HPC cluster is a shared resource used by many researchers (more than 150/year). Unfortunately, it is also a limited resource, so it is important that t...")
 
Line 10: Line 10:


== Fair Job Submission ==
== Fair Job Submission ==
As a user, you can help to improve the fair sharing of the limited HPC resources. Here are some guidelines that might help:
* '''Only request the resources you really need:''' Typically, you will request one or more cores (using the options <tt>--ntasks</tt> and/or <tt>--cpus-per-task</tt> and you should make sure, that your job is then using all the requested cores. Jobs containing different steps of a workflow should be separated if the different steps use different numbers of cores. You can use job dependencies for this.

Revision as of 14:11, 3 February 2021

Introduction

The HPC cluster is a shared resource used by many researchers (more than 150/year). Unfortunately, it is also a limited resource, so it is important that the sharing is done in fair way. Fair sharing is one of the tasks of job scheduler Slurm, however, you can also do your part as a user of the cluster.

The Scheduler

Slurm has a rather complex fair-share mechanism built-in, which takes into account many factors. On CARL and EDDY, we have enabled two main factors: fair-share and wait time. The first one, the fair-share, is a factor based on the number of jobs a user has been running in the past couple of weeks. The second one is based on the wait time in the queue. If two users have used the same amount of computing time, then the job that was submitted earlier will also start to run earlier. But if one user has not used the cluster for a while, his or her jobs can start before jobs that were submitted earlier. The scheduler also considers the job size and makes sure that resources are kept free until a large job can start. And last but not least, the scheduler uses back-filling which means that small and short-running jobs are started earlier to fill the gaps in between larger and long-running jobs.

Unfortunately, the scheduler has problems, if the jobs are very different in their resource requirements (single-core vs. many-core, low vs. high-memory) and if jobs are allowed to run for very long times. This is, of course, exactly the situation on CARL. This is also the reason why at other supercomputing centers (e.g. HLRN) the scheduling policies are much more restrictive (allocation only for full nodes, runtimes limited to 12-24h). However, there are good reasons to have more flexible scheduling on CARL.

Fair Job Submission

As a user, you can help to improve the fair sharing of the limited HPC resources. Here are some guidelines that might help:

  • Only request the resources you really need: Typically, you will request one or more cores (using the options --ntasks and/or --cpus-per-task and you should make sure, that your job is then using all the requested cores. Jobs containing different steps of a workflow should be separated if the different steps use different numbers of cores. You can use job dependencies for this.