Queues and resource allocation
The thing about queues is that, in general, you don't have to worry about them. Ideally you only specify resources for the job you are about to submit. In doing so you provide enough information to the scheduler to decide in which queue the job belongs in. Hence, you explicitly allocate resources and implicitly choose a queue. However, in some cases, namely when it comes to the problem of running a job on, say, particular hardware components of the cluster, it is beneficial to know the resources that need to be allocated in order to access a proper queue running on that component.
Albeit you (as a user) should worry more about specifying resources instead of targeting queues it is useful to disentangle the relationship between certain queues that are implemented on the HPC system and the resources that need to be specified in order for the scheduler to address that queue. Also some of you might be familiar with the concept of queues and prefer to think in terms of them.
Listing all possible queues
Now, thinking in terms of queues, you might be interested to see which queues there are on the HPC system. Logged in to your HPC account, you obtain a full list of all
possible queues a job might be placed in by typing the command qconf -sql
. qconf
is a grid engine configuration tool which, among other
things, allows you to list existing queues and queue configurations. In casual terms, the sequence of options -sql
demands: show (s
) queue (q
) list (l
).
As a result you might find the following list of queues:
cfd_him_long.q cfd_him_shrt.q cfd_lom_long.q cfd_lom_serl.q cfd_lom_shrt.q cfd_xtr_expr.q cfd_xtr_iact.q glm_dlc_long.q glm_dlc_shrt.q glm_qdc_long.q glm_qdc_shrt.q mpc_big_long.q mpc_big_shrt.q mpc_std_long.q mpc_std_shrt.q mpc_xtr_ctrl.q mpc_xtr_iact.q mpc_xtr_subq.q uv100_smp_long.q uv100_smp_shrt.q
Obtaining elaborate information for a particular queue
So as to obtain more details about the configuration of a particular queue you just need to specify that queue. E.g. to get elaborate
information on the queue mpc_std_shrt.q
, just type qconf -sq mpc_std_shrt.q
, which yields
qname mpc_std_shrt.q hostlist @mpcs seq_no 10000,[mpcs001.mpinet.cluster=10001], \ [mpcs002.mpinet.cluster=10002], \ ... [mpcs123.mpinet.cluster=10123], \ [mpcs124.mpinet.cluster=10124] load_thresholds np_load_avg=1.75,slots=0 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH ckpt_list NONE pe_list impi impi41 linda molcas mpich mpich2_mpd mpich2_smpd \ openmpi smp mdcs rerun FALSE slots 12 tmpdir /scratch shell /bin/bash prolog root@/cm/shared/apps/sge/scripts/prolog_mpc.sh epilog root@/cm/shared/apps/sge/scripts/epilog_mpc.sh shell_start_mode posix_compliant starter_method NONE suspend_method NONE resume_method NONE terminate_method NONE notify 00:00:60 owner_list NONE user_lists herousers xuser_lists NONE subordinate_list NONE complex_values h_vmem=23G,h_fsize=800G,cluster=hero projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt 192:0:0 s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core INFINITY s_rss INFINITY h_rss INFINITY s_vmem INFINITY h_vmem INFINITY
Among the listed resource attributes some stand out:
pe_list
: specifies the list of parallel environments available for the queue.hostlist
: specifies the list of hosts on which the respective queue is implemented.Here, the name of the hostlist is
@mpcs
. You can view the full list by means of the commandqconf -shgrpl @mpcs
, where-shgrpl
stands for show (s
) host group (hgrp
) list (l
).comples_values
: A list of complex resource attributes a user might allocate for his jobs using theqsub -l
option.E.g., the queue configuration value
h_vmem
is used for the virtual memory size, limiting the amount of total memory a job might consume. An entry in thecomplex_values
list of the queue configuration defines the total available amount of virtual memory on a host or a queue.slots
: number of slots available on the host. They might be shared among all the queues that run on the host.h_rt
: specifies a requestable resource of type time. A submitted job is only eligible to run in this queue, if the specified maximal value ofh_rt=192
h is not exceeded.userlist
: list of users that are eligible to place jobs in the queue.
Requestable resources
The type and amount of requestable resources differs from queue to queue. To facilitate intuition compare, e.g., the main resources for mpc_std_shrt.q
and mpc_std_long.q
:
$ qconf -sq mpc_std_shrt.q | grep "qname\|hostlist\|complex_values\|h_rt" qname mpc_std_shrt.q hostlist @mpcs complex_values h_vmem=23G,h_fsize=800G,cluster=hero h_rt 192:0:0
and
$ qconf -sq mpc_std_long.q | grep "qname\|hostlist\|complex_values\|h_rt" qname mpc_std_long.q hostlist @mpcs complex_values h_vmem=23G,h_fsize=800G,cluster=hero,longrun=true h_rt INFINITY
Note that both queues run on the same hosts, i.e. both have identical hostlists. However, the requestable resource h_rt
and the list of complex values associated to both queues differs. At this point, details
on the resource h_rt
can once more be obtained using the qconf
command:
$ qconf -sc | grep "h_rt\|#" #name shortcut type relop requestable consumable default urgency #---------------------------------------------------------------------------------------------------- h_rt h_rt TIME <= YES NO 0:0:0 0
As can be seen, the relation operator associated to h_rt
reads lower or equal. I.e., so as to be eligible to be placed in the short queue, a job is not allowed to request more than
192h of running time. Regarding the long queue, there is no upper bound on the running time and a job with proper allocated resources might be put in this queue.
Further, note that the long queue features one complex value more than the short queue, namely longrun
. Details about this resource are:
$ qconf -sc | grep "longrun\|#" #name shortcut type relop requestable consumable default urgency #---------------------------------------------------------------------------------------------------- longrun lr BOOL == FORCED NO FALSE 0
So, longrun
is of type BOOL and has the default value FALSE. In order to place a job in the long queue one has to explicitly request to set longrun=true
, see here.
As a further detail, consider the requestable resource h_vmem
. Details about this resource are:
$ qconf -sc | grep "h_vmem\|#" #name shortcut type relop requestable consumable default urgency #---------------------------------------------------------------------------------------------------- h_vmem h_vmem MEMORY <= YES YES 1200M 0
I.e. it is specified as being a consumable resource. Say, you submit a single slot job to the short queue (which, by default, offers 23G per host), requesting h_vmem=4G
. Then,
this amount of memory is consumed, leaving 19G for further usage.
Finally, consider the resource h_fsize
, used on HERO to specify the required scratch disk size. The resource details read:
$ qconf -sc | grep "h_fsize\|#" #name shortcut type relop requestable consumable default urgency #---------------------------------------------------------------------------------------------------- h_fsize h_fsize MEMORY <= YES JOB 10G 0
Hence, a job is eligible to run on either queue only if the requested amount of scratch disk size is less than 800G. In effect, if a job needs more than that it cannot run on a standard HERO node at all. In such a case one has to request a big node, see here.
Listing slots occupied via a particular queue
So as to list how many jobs there are running through a particular queue on hosts within its associated hostlist you can simply use the command qstat
. E.g., to list the jobs that run via the
queue mpc_std_shrt.q
on the host in its hostlist (i.e. the hosts contained in @mpcs
) simply type:
$ qstat -f -l qname=mpc_std_shrt.q queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- mpc_std_shrt.q@mpcs001 BP 0/9/12 9.27 lx26-amd64 --------------------------------------------------------------------------------- mpc_std_shrt.q@mpcs002 BP 0/9/12 9.11 lx26-amd64 --------------------------------------------------------------------------------- mpc_std_shrt.q@mpcs003 BP 0/8/12 12.19 lx26-amd64 ... --------------------------------------------------------------------------------- mpc_std_shrt.q@mpcs017 BP 0/6/12 6.31 lx26-amd64 873601 0.50500 ksp_L1024 alxo9476 r 08/25/2013 01:37:51 1 31 ...
Information on how many jobs are running via the specified queue are given by the three-tuple of numbers in the third column of the list. These specify the number of reserved/used/total slots on the
respective host. Your own jobs are listed underneath their respective host-entry as you can see for one of my jobs running on host mpcs017
.
E.g., as can be seen from the first entry in the list, the host mpcs001
has 9 out of 12 possible slots occupied with jobs supplied via the mpcs_std_shrt.q
.
In principle there are 3 slots left which might be occupied by jobs from other queues that run on that host (if, e.g., enough memory resources are available to do so). As a detail, in order to list the overall
number of jobs on a particular host you might use qstat
in conjunction with the hostname
keyword to filter for host mpcs001
. E.g., to see in detail what is going on at host mpcs001
you might type
$ qstat -f -l hostname=mpcs001 queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- mpc_std_shrt.q@mpcs001 BP 0/9/12 10.21 lx26-amd64 --------------------------------------------------------------------------------- mpc_std_long.q@mpcs001 BP 0/0/12 10.21 lx26-amd64
Apparently an instance of the queues mpc_std_shrt.q
and mpc_std_long.q
are running on that host (this does not come as a surprise since both queues have identical hostlists).
However there are only 9 out of 12 slots occupied. In principle, the scheduler follows a fill up rule wherein jobs are assigned to a host until it is filled up before the next host is considered. According to the
above list, host mpcs002
has already 9 slots filled while host mpcs001
still has 3 vacant slots. Why is this? Well, there are many possible reasons for that. In 90 percent of the cases the reason is that, albeit host mpcs001
offers further slots, it cannot offer further memory for a job. That this is also the case here you might check by monitoring the current value of the consumable resource h_vmem
for that host. Therefore you simply have
to type:
$ qstat -F -l hostname=mpcs001 | grep "qname\|h_vmem" hc:h_vmem=360.000M qf:qname=mpc_std_shrt.q hc:h_vmem=360.000M qf:qname=mpc_std_long.q
This shows that for both queues only h_vmem=360M
are available. Usually there is no job requesting less than that amount of memory and consequently the slots will not be occupied!
Listing the overall number of free slots
A great tool that allows you to view the current cluster status in terms of occupied/free slots and nodes is qfreenodes
. Called on the command line it lists the
details of occupied/free slots and nodes for all availables host-types:
$ qfreenodes Free nodes: cfdl: 0 ( 0 cores) of 122 (1464 cores) cfdh: 25 ( 300 cores) of 64 ( 768 cores) cfdx: 5 ( 40 cores) of 7 ( 56 cores) glmd: 8 ( 32 cores) of 42 ( 168 cores) mpcb: 0 ( 0 cores) of 20 ( 240 cores) mpcs: 6 ( 60 cores) of 130 (1548 cores) mpcx: 0 ( 0 cores) of 0 ( 0 cores) Flagged free nodes (not included above): (potentially not usable due to errors, use -c for more information) cfdx: 1 glmd: 2 mpcb: 1 Cores: cfdl: 12 free, 1452 used, 0 reserved cfdh: 300 free, 468 used, 0 reserved cfdx: 55 free, 1 used, 0 reserved glmd: 136 free, 32 used, 0 reserved mpcb: 69 free, 171 used, 0 reserved mpcs: 502 free, 1046 used, 0 reserved mpcx: 0 free, 0 used, 0 reserved
However, note that the free vs. used description has to be taken with a grain of salt, since, due to a lack of other resources (e.g. memory), a free slot might actually not be able to be occupied by means of a job.
Examples: addressing particular hardware components/queues
Subsequently a few examples for the resource specifications required to address particular hardware components / queues are detailed. For completeness, a list relating the different hosts and queue instances (separately for both HPC components) running thereon is given below:
Hosts | Host group list | List of queue instances |
---|---|---|
cfdh001 … cfdh064 | @cfdh | cfd_him_shrt.q, cfd_him_long.q |
cfdl001 … cfdl122 | @cfdl | cfd_lom_shrt.q, cfd_lom_long.q, cfd_lom_serl.q |
cfdx003 … cfdx007 | cfd_xtr_expr.q | |
cfdx001 ... cfdx002 | cfd_xtr_iact.q |
Hosts | Host group list | List of queue instances |
---|---|---|
glmd001 … glmd042 | @glmd | glm_dlc_shrt.q, glm_dlc_long.q |
glmq001 … glmq015 | @glmq | glm_qdc_shrt.q, glm_qdc_long.q |
mpcb001 … mpcb020 | @mpcb | mpc_big_shrt.q, mpc_big_long.q |
mpcs001 … mpcs124 | @mpcs | mpc_std_shrt.q, mpc_std_long.q |
mpcs125 … mpcs130 | @mpcx | mpc_xtr_subq.q, mpc_xtr_iact.q, mpc_xtr_ctrl.q |
uv100 | uv100_smp_shrt.q, uv100_smp_long.q |
As pointed out earlier, in some cases it might be useful to know which resource allocation statements in your job submission script allow to choose a particular queue instance (or node type) for a job.
Resource allocation statements required for mpc_std_shrt.q
To set up a job submission script that contains all the resource allocation statements needed to place a job in an instance
of the mpc_std_shrt.q
queue, lets list the characteristic properties of that queue using a call to the qconf
tool, first:
$ qconf -sq mpc_std_shrt.q | grep "qname\|hostlist\|complex_values\|h_rt" qname mpc_std_shrt.q hostlist @mpcs complex_values h_vmem=23G,h_fsize=800G,cluster=hero h_rt 192:0:0
As can be seen, a job placed in this queue will run on one of the hosts contained in the hostlist @mpcs
, i.e. one of the execution hosts mpcs001 … mpcs124
. So as to
eventually (!; see discussion below) place a job in a queue instance of this type (e.g. the queue instance mpc_std_shrt.q@mpcs001
), your job submission script should
look similar to the following:
#!/bin/bash #$ -S /bin/bash #$ -cwd ## resource allocation statements: #$ -l h_rt=0:10:0 #$ -l h_vmem=200M #$ -l h_fsize=100M #$ -N test ./myExample
Albeit you should provide estimates for the requested resources, you are not forced to do so. Most of the resources you might want to request have meaningful default values:
- default scratch space requirement:
h_fsize=10G
, - default memory requirement:
h_vmem=1200M
, - default running time:
h_rt=0:0:0
.
Thus, at least for the requested running time you should provide a nonzero estimate.
NOTE:
- If your job needs more than 800G scratch space, you need to request a big node.
- If your job needs to run longer than 192h (i.e. 8d), you need to request an instance of a long queue.
- The same resource allocation statements suffice for the queue
mpc_xtr_subq.q
, with queue instances on the host in@mpcx
(i.e. nodesmpcs125…mpcs130
). If nodesmpcs001…mpcs124
are filled up already, the scheduler resorts to one of these queue instances as an alternative. The difference is: on hostsmpcs125…mpcs130
, 2 slots, 4G memory and 200G scratch space are reserved for other, interactive tasks. One way to ensure this is to set up a separate queue with queue instances on these hosts. Since, in principle, the concept of a queue is hidden from the user (who, in principle, has to worry about resources, only) this is a convenient approach.
Resource allocation statements required for mpc_std_long.q
To set up a job submission script that contains all the resource allocation statements needed to place a job in an instance
of the mpc_std_long.q
queue, lets list the characteristic properties of that queue using a call to the qconf
tool, first:
$ qconf -sq mpc_std_long.q | grep "qname\|hostlist\|complex_values\|h_rt" qname mpc_std_long.q hostlist @mpcs complex_values h_vmem=23G,h_fsize=800G,cluster=hero,longrun=true h_rt INFINITY
As can be seen, a job placed in this queue will run on one of the hosts contained in the hostlist @mpcs
, i.e. one of the execution hosts mpcs001 … mpcs124
.
So as to place a job in a queue instance of this type (e.g. the queue instance mpc_std_long.q@mpcs001), your job submission script should look similar to the following:
#!/bin/bash #$ -S /bin/bash #$ -cwd ## resource allocation statements: #$ -l longrun=True #$ -l h_rt=200:0:0 #$ -l h_vmem=200M #$ -l h_fsize=100M #$ -N test ./myExample
Albeit you should provide estimates for the requested resources, you are not forced to do so. Most of the resources you might want to request have meaningful default values:
- default scratch space requirement:
h_fsize=10G
, - default memory requirement:
h_vmem=1200M
, - default running time:
h_rt=0:0:0
- default
longrun
flag:False
.
Thus, at least for the requested running time you should provide a nonzero estimate (ideally larger that 192h) and you need to force the longrun
flag to take the boolean value True
.
NOTE:
- This queue is appropriate when your job needs less than 800G scratch space and more than 192h to finish.
- If your job needs more than 800G scratch space, you need to request a big node.
Resource allocation statements required for mpc_big_shrt.q
To set up a job submission script that contains all the resource allocation statements needed to place a job in an instance
of the mpc_big_shrt.q
queue, lets list the characteristic properties of that queue using a call to the qconf
tool, first:
$ qconf -sq mpc_big_shrt.q | grep "qname\|hostlist\|complex_values\|h_rt" qname mpc_big_shrt.q hostlist @mpcb complex_values h_vmem=46G,h_fsize=2100G,cluster=hero,bignode=true h_rt 192:0:0
As can be seen, a job placed in this queue will run on one of the hosts contained in the hostlist @mpcb
, i.e. one of the execution hosts mpcb001 … mpcb020
.
So as to place a job in a queue instance of this type (e.g. the queue instance mpc_big_shrt.q@mpcb001), your job submission script should look similar to the following:
#!/bin/bash #$ -S /bin/bash #$ -cwd ## resource allocation statements: #$ -l bignode=True #$ -l h_rt=01:0:0 #$ -l h_vmem=30G #$ -l h_fsize=100M #$ -N test ./myExample
Albeit you should provide estimates for the requested resources, you are not forced to do so. Most of the resources you might want to request have meaningful default values:
- default scratch space requirement:
h_fsize=10G
, - default memory requirement:
h_vmem=1200M
, - default running time:
h_rt=0:0:0
- default
bignode
flag:False
.
Thus, at least for the requested running time you should provide a nonzero estimate and you need to force the bignode
flag to take the boolean value True
.
NOTE:
- This queue is appropriate when your job needs less that 192h to finish and needs more than 800G scratch space or more memory a standard node is able to offer (i.e. 23G).
- There is also a long version of this queue which might be addressed by setting the
longrun
flag to the boolean valueTrue
, i.e. by adding a line with content#$ -l longrun=True
to your job submission script.
Resource allocation statements required for glm_dlc_shrt.q
To set up a job submission script that contains all the resource allocation statements needed to place a job in an instance
of the glm_dlc_shrt.q
queue, lets list the characteristic properties of that queue using a call to the qconf
tool, first:
$ qconf -sq glm_dlc_shrt.q | grep "qname\|hostlist\|complex_values\|h_rt\|slots" qname glm_dlc_shrt.q hostlist @glmd slots 4 complex_values h_vmem=7G,h_fsize=250G,cluster=hero,golem=true h_rt INFINITY
As can be seen, a job placed in this queue will run on one of the hosts contained in the hostlist @glmd
, i.e. one of the execution hosts glmd001 … glmd042
.
So as to place a job in a queue instance of this type (e.g. the queue instance glm_dlc_shrt.q@glmd001), your job submission script should look similar to the following:
NOTE:
- The hosts in the hostlist
@glmd
stem from GOLEM (acronym for "Grossrechner Oldenburg für explizit multidisziplinäre Forschung") - In contrast to the 12 slots a standard node on HERO offers, these nodes feature 4 slots, only.
Resource allocation statements required for glm_qdc_shrt.q
To set up a job submission script that contains all the resource allocation statements needed to place a job in an instance
of the glm_qdc_shrt.q
queue, lets list the characteristic properties of that queue using a call to the qconf
tool, first:
$ qconf -sq glm_qdc_shrt.q | grep "qname\|hostlist\|complex_values\|h_rt" qname glm_qdc_shrt.q hostlist @glmq complex_values h_vmem=15G,h_fsize=800G,cluster=hero,golem_quad=true h_rt INFINITY