Difference between revisions of "Partitions"
(49 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
== Introduction == | |||
The basic resource in Slurm for computations is a compute node. Compute nodes are organized into partitions, which are simply logical sets of compute nodes. Partitions may also overlap, and typically but not always they correspond to specific node configurations. | |||
The command | Partitions define limitations that restrict the resources that can be requested for a job submitted to that partition. The limitations affect the maximum run time, the amount of memory, and the number of available CPU cores (which are called CPUs in Slurm). In addition, partitions may also define default resources that are automatically allocated for jobs if nothing has been specified. | ||
Jobs should be submitted to the partition that best matches the required resources. For example, a job that requires 50G of RAM should better be submitted to a partition that offers a RAM limit of 117G than to a partition that offers a RAM limit of 495G. That way, as few resources as possible are blocked and another user with a higher demand in RAM can run a job earlier. Of course, other considerations may also influence the choice of a partition. | |||
== Summary of Available Partitions == | |||
CARL and EDDY have several partitions available, and most of them are directly connected to a node type (which may differ in the number of CPU cores, the amount of RAM, and so on). Exceptions are the partitions <tt>carl.p</tt> and <tt>eddy.p</tt>, which serve as default partitions and combine nodes of different types. The following table gives an overview of the available partitions and their resource limits. | |||
{| class="wikitable" | |||
|- | |||
!colspan="8" style="background-color:#6B8E23;"| CARL | |||
|- | |||
! Partition !! Node Type !!style="text-align:center"| Node Count !!style="text-align:center"| CPU Cores !!Default RunTime !! Default Memory per Core !! Max Memory per Node !! Misc | |||
|- | |||
| mpcs.p ||MPC-STD ||style="text-align:center"| 158 ||style="text-align:center"|24 || style="text-align:center" rowspan="5"|2h || style="text-align:center"|10 375M || style="text-align:center"|243G || | |||
|- | |||
| mpcl.p ||MPC-LOM ||style="text-align:center"| 128 ||style="text-align:center"|24 ||style="text-align:center"|5 000M || style="text-align:center"| 117G || | |||
|- | |||
| mpcb.p ||MPC-BIG ||style="text-align:center"| 30 ||style="text-align:center"|16 ||style="text-align:center"|30G || style="text-align:center"| 495G ||GTX 1080 (4 nodes á 2 GPUs) | |||
|- | |||
|mpcp.p ||MPC-PP ||style="text-align:center"| 2 ||style="text-align:center"|40 ||style="text-align:center"|50G || style="text-align:center"| 1975G || | |||
|- | |||
|mpcg.p ||MPC-GPU ||style="text-align:center"| 9 ||style="text-align:center"|24 ||style="text-align:center"|10 375M || style="text-align:center"| 243G ||1-2x Tesla P100 GPU | |||
|- | |||
|carl.p || colspan="7"| Combines mpcl.p and mpcs.p, defaults are as for mpcl.p | |||
|- | |||
!colspan="8" style="background-color:#6B8E23;" |EDDY | |||
|- | |||
|cfdl.p ||CFD-LOM ||style="text-align:center"| 160 ||style="text-align:center"|24 || style="text-align:center" rowspan="3"|2h || style="text-align:center"|2 333M || style="text-align:center"| 56G | |||
|- | |||
|cfdh.p ||CFD-HIM ||style="text-align:center"| 81 ||style="text-align:center"|24 ||style="text-align:center"|5 000M || style="text-align:center"| 117G | |||
|- | |||
|cfdg.p ||CFD-GPU ||style="text-align:center"| 3 ||style="text-align:center"|24 ||style="text-align:center"|10G || style="text-align:center"| 243G || 1x Tesla P100 GPU | |||
|- | |||
|eddy.p || colspan="7"| Combines cfdl.p and cfdh.p, defaults are as for cfdl.p | |||
|} | |||
The default time is used if a job is submitted without specifying a maximum runtime with the option <tt>--time</tt>. Likewise, a job submitted without one of the options <tt>--mem</tt> or <tt>--mem-per-cpu</tt> allocated automatically the default memory per requested core. | |||
== Getting Information about the Partitions on CARL and EDDY == | |||
To get more detailed information about a partition and its resource limits, you can use the command <tt>scontrol</tt>: | |||
<pre> | |||
$ scontrol show part carl.p | |||
PartitionName=carl.p | |||
AllowGroups=carl,hrz AllowAccounts=ALL AllowQos=ALL | |||
AllocNodes=ALL Default=NO QoS=N/A | |||
DefaultTime=02:00:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO | |||
MaxNodes=UNLIMITED MaxTime=21-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=24 | |||
Nodes=mpcl[001-128],mpcs[001-158] | |||
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO PreemptMode=OFF | |||
State=UP TotalCPUs=6864 TotalNodes=286 SelectTypeParameters=NONE | |||
DefMemPerCPU=5000 MaxMemPerNode=120000 | |||
</pre> | |||
The [[SLURM Job Management (Queueing) System|SLURM job scheduler]] can also provide information about the current status of the cluster and the partitions using the command <tt>sinfo</tt>. Since we have quite a few partitions, it is a good idea to add an option to only view the information about a specific partitions: | |||
<pre> | |||
$ sinfo -p carl.p | |||
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST | |||
carl.p up 21-00:00:0 2 mix$ mpcl032,mpcs112 | |||
carl.p up 21-00:00:0 4 drain* mpcl[076,106,113],mpcs019 | |||
carl.p up 21-00:00:0 132 mix mpcl[006-009,013,016,018,020,022-029,033,038-040,043,046,063-070,074-075,087-105,107,112,114-126],mpcs[001,003,005,008,018,044,055-059,063-105,113,136-139,152-158] | |||
carl.p up 21-00:00:0 143 alloc mpcl[001-005,010-012,014-015,017,019,021,030-031,034-037,041-042,044-045,047-062,071-073,077-086,108-111,127-128],mpcs[002,004,006-007,009-017,020-043,045-054,060,062,106-111,114-135,144-151] | |||
carl.p up 21-00:00:0 5 idle mpcs[061,140-143] | |||
</pre> | |||
In the output, you can see the <tt>STATE</tt> of the nodes, which can be <tt>idle</tt> if the node is free, <tt>alloc</tt> if the node is busy, and <tt>mix</tt> if the node is busy but has free resources available. Other states can be <tt>drain</tt> or <tt>down</tt>, if the node is not available. | |||
The command <tt>sinfo</tt> has many additional options to modify the output. Some important ones are: | |||
*'''-a, --all''' | *'''-a, --all''' | ||
Line 18: | Line 83: | ||
:Specify the information you want to be displayed. | :Specify the information you want to be displayed. | ||
:If you want to, for example, display the node hostname, the number of CPUs, the CPU load, the amount of free memory, the size of temporary disk, the size of memory per node (in megabytes) you could use the following command: | :If you want to, for example, display the node hostname, the number of CPUs, the CPU load, the amount of free memory, the size of temporary disk, the size of memory per node (in megabytes) you could use the following command: | ||
<pre>$ sinfo -O nodehost,cpus,cpusload,freemem,disk,memory | |||
HOSTNAMES CPUS CPU_LOAD FREE_MEM TMP_DISK MEMORY | |||
cfdh076 24 1.01 97568 115658 128509 | |||
. | |||
. | |||
. | |||
</pre> | |||
[[ | :The size of each field can be modified (syntax: <tt>type[:[.]size]</tt>) to match your needs, for example like this: | ||
<pre>$ sinfo -O nodehost:8,cpus:5,cpusload:8,freemem:10,disk:10,memory:8 | |||
HOSTNAMECPUS CPU_LOADFREE_MEM TMP_DISK MEMORY | |||
cfdh076 24 1.01 97568 115658 128509 | |||
. | |||
. | |||
. | |||
</pre> | |||
The full list and further informations about the command <tt>sinfo</tt> can be found here: [https://slurm.schedmd.com/sinfo.html sinfo] | |||
== Usage of the Partitions on CARL/EDDY == | |||
To select a partition for your job you can either use an option with the <tt>sbatch</tt>-command, e.g. | |||
sbatch --partition carl.p jobscript.sh | |||
or you add a corresponding line to your job script, e.g. | |||
#SBATCH --partition carl.p | |||
You can overwrite a setting in a job script by using the command-line option. You should always specify a partition when submitting a job to the cluster. | |||
{{warningbox|To simplify job submission, you can always specify either the <tt>carl.p</tt>- or the <tt>eddy.p</tt>-partition if your job has no special resource requirements. Do not use the <tt>all_nodes.p</tt>-partition unless you have a good reason for it and your job does not run for more than one day.}} | |||
=== Using GPU Partitions === | |||
When using GPU partitions, it is also necessary to use the following options for your Slurm jobs in order to allocate the GPUs (GPUs are not allocated by selecting the partition). In the command-line it would look like this: | |||
$ sbatch --partition mpcg.p --gres=gpu:1 job_with_gpu.sh | |||
Alternatively, you can add the options to the job script | |||
#SBATCH --partition mpcg.p | |||
#SBATCH --gres=gpu:1 | |||
The option <tt>--gres</tt> allocates a '''G'''eneric '''RES'''ource in the form <tt><type>:<count></tt>. The <tt><count></tt> is defined per compute node, so in the case of GPUS the <tt><count></tt> can be <tt>1</tt> or <tt>2</tt> (because we have GPU nodes with up to two GPUs). | |||
{{warningbox|If you are running non-GPU jobs in one of the GPU-partitions, make sure that there are always a few cores available so that a GPU job is not blocked from using that node.}} | |||
To learn more about submitting jobs, you might want to take a look at [https://wiki.hpcuser.uni-oldenburg.de/index.php?title=SLURM_Job_Management_(Queueing)_System#Information_on_sbatch-options this page]. |
Latest revision as of 13:57, 5 July 2021
Introduction
The basic resource in Slurm for computations is a compute node. Compute nodes are organized into partitions, which are simply logical sets of compute nodes. Partitions may also overlap, and typically but not always they correspond to specific node configurations.
Partitions define limitations that restrict the resources that can be requested for a job submitted to that partition. The limitations affect the maximum run time, the amount of memory, and the number of available CPU cores (which are called CPUs in Slurm). In addition, partitions may also define default resources that are automatically allocated for jobs if nothing has been specified.
Jobs should be submitted to the partition that best matches the required resources. For example, a job that requires 50G of RAM should better be submitted to a partition that offers a RAM limit of 117G than to a partition that offers a RAM limit of 495G. That way, as few resources as possible are blocked and another user with a higher demand in RAM can run a job earlier. Of course, other considerations may also influence the choice of a partition.
Summary of Available Partitions
CARL and EDDY have several partitions available, and most of them are directly connected to a node type (which may differ in the number of CPU cores, the amount of RAM, and so on). Exceptions are the partitions carl.p and eddy.p, which serve as default partitions and combine nodes of different types. The following table gives an overview of the available partitions and their resource limits.
CARL | |||||||
---|---|---|---|---|---|---|---|
Partition | Node Type | Node Count | CPU Cores | Default RunTime | Default Memory per Core | Max Memory per Node | Misc |
mpcs.p | MPC-STD | 158 | 24 | 2h | 10 375M | 243G | |
mpcl.p | MPC-LOM | 128 | 24 | 5 000M | 117G | ||
mpcb.p | MPC-BIG | 30 | 16 | 30G | 495G | GTX 1080 (4 nodes á 2 GPUs) | |
mpcp.p | MPC-PP | 2 | 40 | 50G | 1975G | ||
mpcg.p | MPC-GPU | 9 | 24 | 10 375M | 243G | 1-2x Tesla P100 GPU | |
carl.p | Combines mpcl.p and mpcs.p, defaults are as for mpcl.p | ||||||
EDDY | |||||||
cfdl.p | CFD-LOM | 160 | 24 | 2h | 2 333M | 56G | |
cfdh.p | CFD-HIM | 81 | 24 | 5 000M | 117G | ||
cfdg.p | CFD-GPU | 3 | 24 | 10G | 243G | 1x Tesla P100 GPU | |
eddy.p | Combines cfdl.p and cfdh.p, defaults are as for cfdl.p |
The default time is used if a job is submitted without specifying a maximum runtime with the option --time. Likewise, a job submitted without one of the options --mem or --mem-per-cpu allocated automatically the default memory per requested core.
Getting Information about the Partitions on CARL and EDDY
To get more detailed information about a partition and its resource limits, you can use the command scontrol:
$ scontrol show part carl.p PartitionName=carl.p AllowGroups=carl,hrz AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=02:00:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=21-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=24 Nodes=mpcl[001-128],mpcs[001-158] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO PreemptMode=OFF State=UP TotalCPUs=6864 TotalNodes=286 SelectTypeParameters=NONE DefMemPerCPU=5000 MaxMemPerNode=120000
The SLURM job scheduler can also provide information about the current status of the cluster and the partitions using the command sinfo. Since we have quite a few partitions, it is a good idea to add an option to only view the information about a specific partitions:
$ sinfo -p carl.p PARTITION AVAIL TIMELIMIT NODES STATE NODELIST carl.p up 21-00:00:0 2 mix$ mpcl032,mpcs112 carl.p up 21-00:00:0 4 drain* mpcl[076,106,113],mpcs019 carl.p up 21-00:00:0 132 mix mpcl[006-009,013,016,018,020,022-029,033,038-040,043,046,063-070,074-075,087-105,107,112,114-126],mpcs[001,003,005,008,018,044,055-059,063-105,113,136-139,152-158] carl.p up 21-00:00:0 143 alloc mpcl[001-005,010-012,014-015,017,019,021,030-031,034-037,041-042,044-045,047-062,071-073,077-086,108-111,127-128],mpcs[002,004,006-007,009-017,020-043,045-054,060,062,106-111,114-135,144-151] carl.p up 21-00:00:0 5 idle mpcs[061,140-143]
In the output, you can see the STATE of the nodes, which can be idle if the node is free, alloc if the node is busy, and mix if the node is busy but has free resources available. Other states can be drain or down, if the node is not available.
The command sinfo has many additional options to modify the output. Some important ones are:
- -a, --all
- Display information about all partitions. You will even see partitions that are not available for your group and hidden partitions.
- -l, --long
- Display more detailed informations about the available partitions.
- -N, --Node
- Display a list of every available node.
- -n <nodes>, --nodes=<nodes>
- Display informations about a specific node. Multiple nodes may be comma separated. You can even specify a range of nodes, e.g. mpcs[100-120].
- -O <output_format>, --Format=<output_format>
- Specify the information you want to be displayed.
- If you want to, for example, display the node hostname, the number of CPUs, the CPU load, the amount of free memory, the size of temporary disk, the size of memory per node (in megabytes) you could use the following command:
$ sinfo -O nodehost,cpus,cpusload,freemem,disk,memory HOSTNAMES CPUS CPU_LOAD FREE_MEM TMP_DISK MEMORY cfdh076 24 1.01 97568 115658 128509 . . .
- The size of each field can be modified (syntax: type[:[.]size]) to match your needs, for example like this:
$ sinfo -O nodehost:8,cpus:5,cpusload:8,freemem:10,disk:10,memory:8 HOSTNAMECPUS CPU_LOADFREE_MEM TMP_DISK MEMORY cfdh076 24 1.01 97568 115658 128509 . . .
The full list and further informations about the command sinfo can be found here: sinfo
Usage of the Partitions on CARL/EDDY
To select a partition for your job you can either use an option with the sbatch-command, e.g.
sbatch --partition carl.p jobscript.sh
or you add a corresponding line to your job script, e.g.
#SBATCH --partition carl.p
You can overwrite a setting in a job script by using the command-line option. You should always specify a partition when submitting a job to the cluster.
To simplify job submission, you can always specify either the carl.p- or the eddy.p-partition if your job has no special resource requirements. Do not use the all_nodes.p-partition unless you have a good reason for it and your job does not run for more than one day.
Using GPU Partitions
When using GPU partitions, it is also necessary to use the following options for your Slurm jobs in order to allocate the GPUs (GPUs are not allocated by selecting the partition). In the command-line it would look like this:
$ sbatch --partition mpcg.p --gres=gpu:1 job_with_gpu.sh
Alternatively, you can add the options to the job script
#SBATCH --partition mpcg.p #SBATCH --gres=gpu:1
The option --gres allocates a Generic RESource in the form <type>:<count>. The <count> is defined per compute node, so in the case of GPUS the <count> can be 1 or 2 (because we have GPU nodes with up to two GPUs).
If you are running non-GPU jobs in one of the GPU-partitions, make sure that there are always a few cores available so that a GPU job is not blocked from using that node.
To learn more about submitting jobs, you might want to take a look at this page.