Difference between revisions of "Queues and resource allocation"

From HPC users
Jump to navigationJump to search
Line 173: Line 173:
respective host. E.g., as can be seen from the first entry in the list, the host <code>mpcs001</code> has 9 out of 12 possible slots occupied with jobs supplied via the <code>mpcs_std_shrt.1</code>.
respective host. E.g., as can be seen from the first entry in the list, the host <code>mpcs001</code> has 9 out of 12 possible slots occupied with jobs supplied via the <code>mpcs_std_shrt.1</code>.
In principle there are 3 slots left which might be occupied by jobs from other queues that run on that host (if, e.g., enough memory resources are available to do so). As a detail, in order to list the overall
In principle there are 3 slots left which might be occupied by jobs from other queues that run on that host (if, e.g., enough memory resources are available to do so). As a detail, in order to list the overall
number of jobs on a particular host you might use <code>qstat</code> in conjunction with the <code>hostname</code> keyword to filter for that host. E.g., to see in detail what is going on at host <code>mpcs001</code>
number of jobs on a particular host you might use <code>qstat</code> in conjunction with the <code>hostname</code> keyword to filter for host <code>mpcs001</code>. E.g., to see in detail what is going on at host <code>mpcs001</code>
you might type
you might type
   <nowiki>
   <nowiki>

Revision as of 17:14, 26 August 2013

The thing about queues is that, in general, you don't have to worry about them. Ideally you only specify resources for the job you are about to submit. In doing so you provide enough information to the scheduler to decide in which queue the job belongs in. Hence, you explicitly allocate resources and implicitly choose a queue. However, in some cases, namely when it comes to the problem of running a job on, say, particular hardware components of the cluster, it is beneficial to know the resources that need to be allocated in order to access a proper queue running on that component.

Albeit you (as a user) should worry more about specifying resources instead of targeting queues it is useful to disentangle the relationship between certain queues that are implemented on the HPC system and the resources that need to be specified in order for the scheduler to address that queue. Also some of you might be familiar with the concept of queues and prefer to think in terms of them.

Listing all possible queues

Now, thinking in terms of queues, you might be interested to see which queues there are on the HPC system. Logged in to your HPC account, you obtain a full list of all possible queues a job might be placed in by typing the command qconf -sql. qconf is a grid engine configuration tool which, among other things, allows you to list existing queues and queue configurations. In casual terms, the sequence of options -sql demands: show (s) queue (q) list (l). As a result you might find the following list of queues:

 
cfd_him_long.q
cfd_him_shrt.q
cfd_lom_long.q
cfd_lom_serl.q
cfd_lom_shrt.q
cfd_xtr_expr.q
cfd_xtr_iact.q
glm_dlc_long.q
glm_dlc_shrt.q
glm_qdc_long.q
glm_qdc_shrt.q
mpc_big_long.q
mpc_big_shrt.q
mpc_std_long.q
mpc_std_shrt.q
mpc_xtr_ctrl.q
mpc_xtr_iact.q
mpc_xtr_subq.q
uv100_smp_long.q
uv100_smp_shrt.q
  

Obtaining elaborate information for a particular queue

So as to obtain more details about the configuration of a particular queue you just need to specify that queue. E.g. to get elaborate information on the queue mpc_std_shrt.q, just type qconf -sq mpc_std_shrt.q, which yields

 
qname                 mpc_std_shrt.q
hostlist              @mpcs
seq_no                10000,[mpcs001.mpinet.cluster=10001], \
                      [mpcs002.mpinet.cluster=10002], \
                      ...
                      [mpcs123.mpinet.cluster=10123], \
                      [mpcs124.mpinet.cluster=10124]
load_thresholds       np_load_avg=1.75,slots=0
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH
ckpt_list             NONE
pe_list               impi impi41 linda molcas mpich mpich2_mpd mpich2_smpd \
                      openmpi smp mdcs
rerun                 FALSE
slots                 12
tmpdir                /scratch
shell                 /bin/bash
prolog                root@/cm/shared/apps/sge/scripts/prolog_mpc.sh
epilog                root@/cm/shared/apps/sge/scripts/epilog_mpc.sh
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            herousers
xuser_lists           NONE
subordinate_list      NONE
complex_values        h_vmem=23G,h_fsize=800G,cluster=hero
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  192:0:0
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY
  

Among the listed resource attributes some stand out:

  • pe_list: specifies the list of parallel environments available for the queue.
  • hostlist: specifies the list of hosts on which the respective queue is implemented.

    Here, the name of the hostlist is @mpcs. You can view the full list by means of the command qconf -shgrpl @mpcs, where -shgrpl stands for show (s) host group (hgrp) list (l).

  • comples_values: A list of complex resource attributes a user might allocate for his jobs using the qsub -l option.

    E.g., the queue configuration value h_vmem is used for the virtual memory size, limiting the amount of total memory a job might consume. An entry in the complex_values list of the queue configuration defines the total available amount of virtual memory on a host or a queue.

  • slots: number of slots available on the host. They might be shared among all the queues that run on the host.
  • h_rt: specifies a requestable resource of type time. A submitted job is only eligible to run in this queue, if the specified maximal value of h_rt=192h is not exceeded.
  • userlist: list of users that are eligible to place jobs in the queue.

Requestable resources

The type and amount of requestable resources differs from queue to queue. To facilitate intuition compare, e.g., the resources for mpc_std_shrt.q and mpc_std_long.q:

 
$ qconf -sq mpc_std_shrt.q | grep "qname\|hostlist\|complex_values\|h_rt"
qname                 mpc_std_shrt.q
hostlist              @mpcs
complex_values        h_vmem=23G,h_fsize=800G,cluster=hero
h_rt                  192:0:0

$ qconf -sq mpc_std_long.q | grep "qname\|hostlist\|complex_values\|h_rt"
qname                 mpc_std_long.q
hostlist              @mpcs
complex_values        h_vmem=23G,h_fsize=800G,cluster=hero,longrun=true
h_rt                  INFINITY
   

Note that both queues run on the same hosts, i.e. both have identical hostlists. However, the requestable resource h_rt and the list of complex values associated to both queues differs. At this point, details on the resource h_rt can once more be obtained using the qconf command:

 
> qconf -sc | grep "h_rt\|#"
#name                   shortcut       type        relop   requestable consumable default  urgency   
#----------------------------------------------------------------------------------------------------
h_rt                    h_rt           TIME        <=      YES         NO         0:0:0    0  

As can be seen, the relation operator associated to h_rt reads lower or equal. I.e., so as to be eligible to be placed in the short queue, a job is not allowed to request more than 192h of running time. Regarding the long queue, there is no upper bound on the running time and a job with proper allocated resources might be put in this queue.

Further, note that the long queue features one complex value more than the short queue, namely longrun. Details about this resource are:

 
$ qconf -sc | grep "longrun\|#"
#name                   shortcut       type        relop   requestable consumable default  urgency   
#----------------------------------------------------------------------------------------------------
longrun                 lr             BOOL        ==      FORCED      NO         FALSE    0
  

So, longrun is of type BOOL and has the default value FALSE. In order to place a job in the long queue one has to explicitly request to set longrun=true, see here.

As a detail, consider the requestable resource h_vmem. Details about this resource are:

 
$ qconf -sc | grep "h_vmem\|#"
#name                   shortcut       type        relop   requestable consumable default  urgency   
#----------------------------------------------------------------------------------------------------
h_vmem                  h_vmem         MEMORY      <=      YES         YES        1200M    0
  

I.e. it is specified as being a consumable resource. Say, you submit a single slot job to the short queue (which, by default, offers 23G per host), requesting h_vmem=4G. Then, this amount of memory is consumed, leaving 19G for further usage.

Listing jobs for a particular queue

So as to list how many jobs are running over a particular queue on the host within the associated hostlist you can simply use the command qstat. E.g., to list the jobs that run via the queue mpc_std_shrt.q on the host in its hostlist (i.e. the hosts contained in @mpcs) simply type:

 
$ qstat -f -l qname=mpc_std_shrt.q 

queuename              qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
mpc_std_shrt.q@mpcs001 BP    0/9/12         9.27     lx26-amd64    
---------------------------------------------------------------------------------
mpc_std_shrt.q@mpcs002 BP    0/9/12         9.11     lx26-amd64    
---------------------------------------------------------------------------------
mpc_std_shrt.q@mpcs003 BP    0/8/12         12.19    lx26-amd64    
...
mpc_std_shrt.q@mpcs017 BP    0/6/12         6.31     lx26-amd64    
 873601 0.50500 ksp_L1024  alxo9476     r     08/25/2013 01:37:51     1 31
...
  

Information on how many jobs are running via the specified queue are given by the three-tuple of number in the third column of the list. These specify the number of reserved/used/total slots on the respective host. E.g., as can be seen from the first entry in the list, the host mpcs001 has 9 out of 12 possible slots occupied with jobs supplied via the mpcs_std_shrt.1. In principle there are 3 slots left which might be occupied by jobs from other queues that run on that host (if, e.g., enough memory resources are available to do so). As a detail, in order to list the overall number of jobs on a particular host you might use qstat in conjunction with the hostname keyword to filter for host mpcs001. E.g., to see in detail what is going on at host mpcs001 you might type

 
$ qstat -f -l hostname=mpcs001

queuename              qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
mpc_std_shrt.q@mpcs001 BP    0/9/12         10.21    lx26-amd64    
---------------------------------------------------------------------------------
mpc_std_long.q@mpcs001 BP    0/0/12         10.21    lx26-amd64      
  

Apparently an instance of the queues mpc_std_shrt.q and mpc_std_long.q are running on that host (but we already knew this since both queues have identical hostlists). However there are only 9 out of 12 slots occupied. In principle the scheduler follows a fill up rule, wherein jobs are assigned to a host until it is filled up before the next host is considered. According to the above list, host mpcs002 has already 9 slots filled. How is this? Well, there are many possible reasons for that. In 90 percent of the cases the reason is that, albeit host mpcs001 offers further slots, it cannot offer further memory for a job. That this is also the case here you might check by monitoring the current value of the consumable resource h_vmem for that host. Therefore you simply have to type:

 
$ qstat -F -l hostname=mpcs001 | grep "qname\|h_vmem"

	hc:h_vmem=360.000M
	qf:qname=mpc_std_shrt.q
	hc:h_vmem=360.000M
	qf:qname=mpc_std_long.q 
  

This shows that for both queues only h_vmem=360M are available. Usually there is no job requesting less than that amount of memory!

Long and short queues

Addressing particular hardware components/queues