Memory Overestimation

From HPC users
Revision as of 16:52, 1 August 2013 by Melchert (talk | contribs)
Jump to navigationJump to search

One of the most important consumable resources that needs to be allocated for a job is its memory. As pointed out in the [SGE_Job_Management_(Queueing)_System#Memory|overview], the memory, meaning physical plus virtual memory, is addressed by the h_vmem option. The h_vmem attribute refers to the memory per job slot, i.e. it gets multiplied by the number of slots when submitting a parallel job. Different types of nodes offer different amounts of total memory for jobs to run. E.g., a standard node on HERO (overall number: 130) offers 23GB and a big node on HERO (overall number: 20) offers 46GB.

The operation efficiency of the HPC system, in particuar the workload of the HERO component (following a fill-up allocation rule for the jobs), is directly linked to the allocation of memory for the jobs to be run. For the Cluster to operate as efficient as possible, a proper memory allocation by the users, i.e. as few memory overestimation per job as possible, is inevitable.

Consider the following extreme example: a user specifies the amount of h_vmem=1.7G for a 12-slot job. I.e., the job allocates an overall amount of 20.4G. However, upon execution the job only used an overall amount of approximately 2G. The parallel environment for this actual example was smp, so the job ran on a single execution host and the parallel environment memory issue is no issue here. Albeit this is a rather severe example of memory overestimation (about 18G overall, or, 1.5 per used slot) this job does not block other jobs from running on that particular execution host, since it uses the full amount of available slots (i.e. 12 slots).

However, note that there are also other examples that do have an impact on ohter users: a user specifies the amount of h_vmem=6G for a single-slot job, which turned out to have a peak memory usage of 36M, only. Due due memory restrictions, four such jobs cannot run on a single host. However, by means of three such jobs one can block lots of resources, leaving 5G for the remaining 9 slots. Given the fact that, here, a typical job uses 2-3G, this allows for only two further jobs. In this case, the memory dissipation on that host amounts to 17G (in the most optimistic case).

Also, note that the above examples have to be taken with a grain of salt. In general there is a difference betwenn the peak memory usage of a job and its current memory usage. It might very well be that a job proceeds in two parts, (i) a part where it needs only few memory, and, (ii) a part where it requires a lot of memory. However, such a two step scenario seems to occur only rarely.

In particular in late Mai and early June of 2013 there was a phase where the cluster was used exhaustively and where the memory overestimation by the individual users was a severe issue. Consequently, to avoid such situations in the future, it seems to be necessary to point out the benefit of proper memory allocation from time to time.


The two plots below summarize some data that quantify the memory usage for the 130 standard nodes on HERO. The data was collected since 18 June 2013, where, once per working day, the memory usage of the nodes was monitored.

Memory overestimation per host

The figure below illustrates a scatterplot of the memory overestimation (measured once per day since 18 June 2013) for the 130 standard node on HERO (i.e. the nodes mpcs001 through mpcs130). Ordinate and abscissa refer to the used memory and allocated memory , respectively. Regarding the main plot, the solid diagonal line indicates the optimal operating mode where . However, as can be seen from the figure the actual usage is far below that. The bulk of data point suggests that most of the time are allocated per execution host. As can be seen from the inset, the actual memory usage per host is strongly peaked in the region . For the region this indicates a memory overestimation of approximately . Further, the dashed line indicates . As can be seen, there seems to be no host (since 18 June 2013) on which significantly more than 80% of its allocated memory are actually used.

MemDissipation raw mem.png

Memory overestimation per used slot

The memory overestimation per used slot (for the same dataset) is shown below. The figure illustrates the amount of used memory per slot as function of the allocated memory per slot on the standard nodes of HERO. The inset shows the data for the full range of observed and the main plot shows a zoom in on the range (measured in ). As evident from the figure, most jobs allocate less than per slot and in the range the approximate upper bound seems fitting. Apparently, the bulk of data seems to be located in the region in between the lines and (i.e. the shaded region), indicating that, typically, the required memory per slot is overestimated by . However, be aware that we are illustrating averages per host and per slot, here. For a given host there might be individual jobs that do much better (i.e. where it holds that ) that what is shown here.

MemDissipation raw memPerSlot.png

Average memory overestimation

To summarize the statistics shown above, we might consider the differences between the allocated and used memory. In this regard, the figure below shows the differences (top) and (bottom), averaged for the 130 standard nodes of HERE, over time (since 18 June 2013). The error bars indicate the respective standard deviation. From the top figure it can be seen that this average memory difference approximately falls into the range . From the figure at the bottom it is evident that the average memory difference falls within the range . These approximate values (stemming from a simple per-day average) are in good agreement with the above results estimated from the scatter plots.

MemDissipation av.png

Long term goal

Regarding the operation efficiency of HERO and the optimization thereof, it would be desirable to reduce the memory overestimation per used slot, i.e. the quantity to a value (permanently) close to or slightly below and to reduce the related standard deviation. This would mean that on average, still an overestimation of per slot is accepted.