How to Share Fair

From HPC users
Jump to navigationJump to search

Introduction

The HPC cluster is a shared resource used by many researchers (more than 150/year). Unfortunately, it is also a limited resource, so it is important that the sharing is done in fair way. Fair sharing is one of the tasks of job scheduler Slurm, however, you can also do your part as a user of the cluster.

The Scheduler

Slurm has a rather complex fair-share mechanism built-in, which takes into account many factors. On CARL and EDDY, we have enabled two main factors: fair-share and wait time. The first one, the fair-share, is a factor based on the number of jobs a user has been running in the past couple of weeks. The second one is based on the wait time in the queue. If two users have used the same amount of computing time, then the job that was submitted earlier will also start to run earlier. But if one user has not used the cluster for a while, his or her jobs can start before jobs that were submitted earlier. The scheduler also considers the job size and makes sure that resources are kept free until a large job can start. And last but not least, the scheduler uses back-filling which means that small and short-running jobs are started earlier to fill the gaps in between larger and long-running jobs.

Unfortunately, the scheduler has problems, if the jobs are very different in their resource requirements (single-core vs. many-core, low vs. high-memory) and if jobs are allowed to run for very long times. This is, of course, exactly the situation on CARL. This is also the reason why at other supercomputing centers (e.g. HLRN) the scheduling policies are much more restrictive (allocation only for full nodes, runtimes limited to 12-24h). However, there are good reasons to have more flexible scheduling on CARL.

Fair Job Submission

As a user, you can help to improve the fair sharing of the limited HPC resources. Here are some guidelines that might help:

  • Only submit jobs that you need to run: The use of the HPC cluster is essentially free of charge for the users, but one should not forget that the provision of resources costs money. Accordingly, one should use the available resources sparingly and only start those computations that are useful for one's own research. Of course, this also includes smaller test calculations. If, for example, I vary a value a hundred times in a parameter study, then I should consider beforehand whether 50 variations might not be enough to achieve the same result.
  • Only request the resources you really need: Typically, you will request one or more cores (using the options --ntasks and/or --cpus-per-task and you should make sure, that your job is then using all the requested cores. Jobs containing different steps of a workflow should be separated if the different steps use different numbers of cores. You can use job dependencies for this. In addition to cores, you may also request memory with --mem (per node) or --mem-per-cpu (per core). If you are using more than the default memory in a given partition (e.g. 5000M/core in carl.p) then you should check if you really used the requested memory.
  • Try to limit the number of nodes you use: For MPI-parallel jobs, the recommended settings are --nodes and --tasks-per-node 24 so if possible the jobs should use a multiple of 24 as total number of tasks. However, this may result in longer wait times if too many single-core jobs are distributed across the system, so therefore these jobs (if you have to submit many of them at the same time, which is often the case) should be packed in such a way that they can utilize complete nodes. This can be achieved with the parallel-command as explained here in section 4.
  • Make sure your job runs at the optimal performance: Whether an application runs with optimal performance on the cluster is often difficult to judge, also because a number of factors play a role. First of all, the application should have been built with the latest possible compiler and all compiler optimizations. Wherever possible, the available numerical libraries such as LAPACK, FFTW, or MKL should be used. The applications provided in the modules fulfill these requirements. For parallel applications, benchmarks should be used to find out how well the application scales on the system and with what number of cores efficient parallel computation is still possible. For I/O, $WORK should preferably be used.
  • Limit the resources used by you: If you have a large project with many big simulations then be kind to others and monitor how many of your jobs are running. There are about 150 users active on the cluster, which means a user can use on average two computer nodes. Of course, not all users are active at the same time, so using 10 or 20 nodes is not a problem, using 100 nodes can be. And using 20 nodes for one day is better than blocking 20 nodes for three weeks. To limit yourself, you can submit jobs with dependencies or job chains. Job arrays have an extra setting to limit the number of actively running tasks.
  • Use external HPC Resources: More HPC resources are available from other German (or maybe even European) Supercomputing centers. The next step, so to speak, for HPC users from Oldenburg, is the HLRN where you can request an account for testing and preparing a project proposal. The proposals do not need to be very long and if needed, you can contact Scientific Computing for support.

Of course, these are only guidelines and not everything can be applied in every situation. But the more computing time you are using or planning to use, the more you should think about how to organize your jobs on the cluster. In case you need support, please contact Scientific Computing.

Maximum Number of Running Jobs per Group

In addition to the scheduling policies explained above, we are now enforcing a limit on the maximum number of jobs that can be run by the members of a research group (identified by the unix groups agname. This limitation will not be noticed by most users except when running large job arrays with more than 250 tasks. In that case, at most 250 tasks will be actively running at the same time while the rest of the tasks in the array will be waiting in the queue with the reason AssociationJobLimit.

A problem can occur if too many users from the same group run large job arrays at the same time. In that case, it is recommended to use the parallel-command which can be used to run all the tasks in a single job. As an alternative, users of the same group should be able to limit their job arrays so that everyone in the group can run at least a few simulations.