How to Share Fair

From HPC users
Revision as of 12:30, 3 February 2021 by Harfst (talk | contribs) (Created page with "== Introduction == The HPC cluster is a shared resource used by many researchers (more than 150/year). Unfortunately, it is also a limited resource, so it is important that t...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Introduction

The HPC cluster is a shared resource used by many researchers (more than 150/year). Unfortunately, it is also a limited resource, so it is important that the sharing is done in fair way. Fair sharing is one of the tasks of job scheduler Slurm, however, you can also do your part as a user of the cluster.

The Scheduler

Slurm has a rather complex fair-share mechanism built-in, which takes into account many factors. On CARL and EDDY, we have enabled two main factors: fair-share and wait time. The first one, the fair-share, is a factor based on the number of jobs a user has been running in the past couple of weeks. The second one is based on the wait time in the queue. If two users have used the same amount of computing time, then the job that was submitted earlier will also start to run earlier. But if one user has not used the cluster for a while, his or her jobs can start before jobs that were submitted earlier. The scheduler also considers the job size and makes sure that resources are kept free until a large job can start. And last but not least, the scheduler uses back-filling which means that small and short-running jobs are started earlier to fill the gaps in between larger and long-running jobs.

Unfortunately, the scheduler has problems, if the jobs are very different in their resource requirements (single-core vs. many-core, low vs. high-memory) and if jobs are allowed to run for very long times. This is, of course, exactly the situation on CARL. This is also the reason why at other supercomputing centers (e.g. HLRN) the scheduling policies are much more restrictive (allocation only for full nodes, runtimes limited to 12-24h). However, there are good reasons to have more flexible scheduling on CARL.

Fair Job Submission