Welcome to the HPC User Wiki of the University of Oldenburg

From HPC users
Revision as of 11:00, 23 November 2012 by Albensoeder (talk | contribs)
Jump to navigationJump to search

Note: This is a first, preliminary version (v0.01) of the HPC User Wiki. Its primary purpose is to get you started with our new clusters (FLOW and HERO), enabling you to familiarize with these systems and gather some experience. More elaborate, improved versions will follow, so you may want to check these pages regularly.

The Wiki pages will be restructured within the next days!!! Contents of the main page will be shifted to subpages to make the Wiki more readable!

Introduction

Presently, the central HPC facilities of the University of Oldenburg comprise three systems:

  • FLOW (Facility for Large-Scale COmputations in Wind Energy Research)
    IBM iDataPlex cluster solution, 2232 CPU cores, 6 TB of (distributed) main memory, QDR InfiniBand interconnect.
    Theoretical peak performance: 24 TFlop/s.
  • HERO (High-End Computing Resource Oldenburg)
    Hybrid system composed of two components:
    • IBM iDataPlex cluster solution, 1800 CPU cores, 4 TB of (distributed) main memory, Gigabit Ethernet interconnect.
      Theoretical peak performance: 19.2 TFlop/s.
    • SGI Altix UltraViolet shared-memory system ("SMP component"), 120 CPU cores, 640 GB of globally addressable memory, NumaLink5 interconnect
      Theoretical peak performance: 1.3 TFlop/s.
  • GOLEM: older, AMD Opteron-based cluster with 390 cores and 800 GB of (distributed) main memory.
    Theoretical peak performance: 1.6 TFlop/s.

FLOW and HERO use a common, shared storage system (high-performance NAS Cluster) with a net capacity of 130 TB.

FLOW is used for computationally demanding CFD calculations in wind energy research, conducted by the Research Group TWiST (Turbulence, Wind Energy, and Stochastis) and the ForWind Center for Wind Energy Research. It is, to the best of our knowledge, the largest system in Europe dedicated solely to that purpose.

The main application areas of the HERO cluster are Quantum Chemistry, Theoretical Physics, the Neurosciences, and Audiology. Besides that, the system is used by many other research groups of the Faculty of Mathematics and Science and the Department of Informatics of the School of Computing Science, Business Administration, Economics, and Law.

Hardware Overview

FLOW

  • 122 "low-memory" compute nodes: IBM dx360 M3, dual socket (Westmere-EP, 6C, 2.66 GHz), 12 cores per server, 24 GB DDR3 RAM, diskless (host names cfdl001..cfdl122).
  • 64 "high-memory" compute nodes: IBM dx360 M3, dual socket (Westmere-EP, 6C, 2.66 GHz), 12 cores per server, 48 GB DDR3 RAM, diskless (host names cfdh001..cfdh064).
  • QDR InfiniBand interconnect (fully non-blocking), 198-port Mellanox IS5200 IB switch (can be extended up to 216 ports).
  • Gigabit Ethernet for File-I/O etc.
  • 10/100 Mb/s Ethernet for management and administrative tasks (IPMI).

HERO

  • 130 "standard" compute nodes: IBM dx360 M3, dual socket (Westmere-EP, 6C, 2.66 GHz), 12 cores per server, 24 GB DDR3 RAM, 1 TB SATAII disk (host names mpcs001..mpcs130).
  • 20 "big" compute nodes: IBM dx360 M3, dual socket (Westmere-EP, 6C, 2.66 GHz), 12 cores per server, 48 GB DDR3 RAM, RAID 8 x 300 GB 15k SAS (host names mpcb001..mpcb020)
  • Gigabit Ethernet II for communication of parallel jobs (MPI, LINDA, ...).
  • Second, independent Gigabit Ethernet for File-I/O etc.
  • 10/100 Mb/s Ethernet for management and administrative tasks (IPMI).
  • SGI Altix UV 100 shared-memory system, 10 CPUs (Nehalem-EX, "Beckton", 6C, 2.66 GHz), 120 cores in total, 640 GB DDR3 RAM, NumaLink5 interconnect, RAID 20 x 600 GB SAS 15k rpm (host uv100).

The 1 Gb/s leaf switches have uplinks to a 10 Gb/s backbone (two switches, redundant). The central management interface of both clusters runs on two master nodes (IBM x3550 M3) in an HA setup. Each cluster has two login nodes (IBM x3550 M3).

Operating system: Scientific Linux 5.5

Cluster management software: Bright Cluster Manager 5.1 by ClusterVision B.V.

Basic Usage

Logging in to the system

From within the University (intranet)

Within the internal net of the University, access to the systems is granted via ssh. Use your favorite ssh client like OpenSSH, PuTTY, etc. For example, on a UNIX/Linux system, users of FLOW may type on the command line (replace "abcd1234" by your own account):

ssh abcd1234@flow.hpc.uni-oldenburg.de

Similarly, users of HERO login by typing:

ssh abcd1234@hero.hpc.uni-oldenburg.de

Use "ssh -X" for X11 forwarding (i.e., if you need to export the graphical display to your local system).

For security reasons, access to the HPC systems is denied from certain subnets. In particular, you cannot login from the WLAN of the University (uniolwlan) or from "public" PCs (located, e.g., in Libraries, PC rooms, or at other places).

From outside the University (internet)

First, you have to establish a VPN tunnel to the University intranet. After that, you can login to HERO or FLOW via ssh as described above. The data of the tunnel are:

Gateway       : vpn2.uni-oldenburg.de
Group name    : hpc-vpn
Group password: hqc-vqn

Cf. the instructions of the IT Services on how to configure the Cisco VPN client. For the HPC systems, a separate VPN tunnel has been installed, which is only accessible for users of FLOW and HERO. Therefore, you have to configure a new VPN connection and enter the data provided above. For security reasons, you cannot login to FLOW or HERO if you are connected to the intranet via the "generic" VPN tunnel of the University.

User Environment

We use the module environment, which has a lot of advantages, is very flexible (and user-friendly), and even allows one to use different versions of the same software concurrently on the same system. You can see a list of all available modules by typing

module avail

To load a given module:

module load <name of the module>

The modules system uses a hierarchical file structure, i.e., sometimes (whenever there are ambiguities) you may have to specify a path, as in:

module load fftw2/gcc/64/double

To revert all changes made by a given module (environment variables, paths, etc.):

module unload <name of the module>

To see which modules are already loaded:

module list

Compiling and linking

This section will be elaborated later and then provide much more detailed information. For the time being, we only give a very brief overview.

The following compilers and MPI libraries are currently available:

  • GCC, the GNU Compiler Collection: gcc Version 4.3.4
    module load gcc
    This module is loaded per default if you log in to the system.Supported MPI libraries: OpenMPI, MPICH, MPICH2, MVAPICH and MVAPICH2
  • Intel Cluster Studio 2011, formerly known as Intel Cluster Toolkit Compiler Edition (contains the Math Kernel Library and other performance libraries, analyzer, and HPC tools):
    module load intel/ics
    The environment for the Intel MPI library must be loaded separately:
    module load intel/impi
    The Fortran compuler is invoked by ifort, and the C/C++ compiler by icc. However, if one wants to build MPI applications, one should generally use the wrapper scripts mpif77, mpif90, mpicc, ...
  • PGI Cluster Development Kit, Version 11.3: contains a suite of Fortran and C/C++ compiler as well as various other tools (MPI debugger etc.):
    module load pgi
    . Invoked by pgf77, pgf95, ... and pgcc, pgcpp, ... for FORTRAN and C/C++, respectively. Again, wrapper scripts exist for building MPI applications.
    Supported MPI libraries: MPICH, MPICH2, and MVAPICH.

(At the moment, MPICH and MPICH2 have problems running under the queueing system and thus their use is not recommended, but that problem will be fixed soon.)

Is planned to extend the MPI support for various compilers. In particular, OpenMPI will soon be available for the Intel compiler, too.

Documentation

PGI User's Guide (PDF)

Job Management (Queueing) System

The queueing system employed to manage user jobs for FLOW and HERO is Sun Grid Engine (SGE). For first-time users (especially those acquainted with PBS-based systems), some features of SGE may seem a little unusual and certainly need some getting-accustomed-to. In order to efficiently use the available hardware resources (so that all users may benefit the most from the system), a basic understanding of how SGE works is indispensable. Some of the points to keep in mind are the following:

  • Unlike other (e.g., PBS-based) queueing systems, SGE does not "know" the concept of "nodes" with a fixed number of CPUs (cores) and users specifying the number of nodes they need, along with the number of CPUs per node, in their job requirements. Instead, SGE logically divides the cluster into slots, where, roughly speaking, each "slot" may be thought of as a single CPU core (although there are notable exceptions to this rule, see the parallel environment linda below. The scheduler assigns free slots to pending jobs. Since in the multi-core area each host offers many slots, this will, in general, lead to jobs of different users running concurrently on the same host (provided that there are sufficient resources like memory, disk space etc. to meet all requirements of all jobs, as specified by the users who submitted them) and usually guarantees efficient resource utilization.
  • While the scheduling behavior described above may be very efficient in optimally using the available hardware resources, it will have undesirable effects on parallel (MPI, LINDA, ...) jobs. E.g., an MPI job requesting 24 slots could end up running 3 tasks on one host, 12 tasks on another host (fully occupying this host, if it is a server with 2 six-core CPUs, as happens with our clusters), and 9 tasks on a third host. Clearly, such an unbalanced configuration may lead to problems. For certain jobs (multithreaded applications) it is even mandatory that all slots reside on one host (typical examples: OpenMP programs, Gaussian single-node jobs).
    To deal with the specific demands of parallel jobs, SGE offers so-called parallel environments (PEs) which are largely configurable. Even if your job does not need several hosts, but runs on only one host using several or all cores of that host, you must specify a parallel environment. It is of crucial importance to choose the "correct" parallel environment (meeting the requirements of your application/program) when submitting a parallel job.
  • Another "peculiarity" of SGE (as compared to its cousins) are the concepts of cluster queues and queue instances. Cluster queues are composed of several (typically, many) queue instances, with each instance associated with one particular host. A cluster queue may have a name like, e.g., standardqueue.q, where the .q suffix is a commonly followed convention. Then the queue instances of this queue has names like, e.g. standardqueue.q@host001, standardqueue.q@host002, ... (note the "@" which acts as a delimiter between the queue name and the queue instance).
    In general, each host will hold several queue instances belonging to different cluster queues. E.g. there may be a special queue for long-running jobs and a queue for shorter jobs, both of which share the same "physical" machines but have different policies. To avoid oversubscription, resource limits can be configure for individual hosts. Since resource limits and other, more complex attributes can also be associated with cluster queues and even queue instances, the system is highly flexible and can be customized for specified needs. On the other hand, the configuration quickly tends to get rather complex, leading to unexpected side effects. E.g., PEs grab slots from all queue instances of all cluster queues they are associated with. Thus, a parallel job may occupy slots on one particular host belonging to different queue instances on that host. While this is usually no problem for the parallel job itself, it blocks resources in both cluster queues which may be unintended. For that reason, it is common practice to associate each PE with one and only one cluster queue and define several (possibly identically configured) PEs in order to avoid that a single PE spans several cluster queues.

Submitting jobs

Sample job submission scripts for both serial and parallel jobs are provided in the subdirectory Examples of your homedirectory. You may have to adapt these scripts as needed. Note that a job submission script consists of two principal parts:

  • SGE directives (lines starting with the "magic" characters #$), which fall into three categories:
    • general options (which shell to use, name of the job, name of output and error files if differing from default, etc.). The directives are passed to the qsub command when the job is submitted.
    • Resource requirements (introduced by the -l flag), like memory, disk space, runtime (wallclock) limit, etc.
    • Options for parallel jobs (parallel environment, number of job slots, etc.)
  • Commands to be executed by the job (your program, script, etc.), including the necessary set-up of the environment for the application/program to run correctly (loading of modules so that your programs find the required runtime libraries, etc.).

The job is submitted by the qsub command, e.g. (assuming your submission script is named"myprog.sge):

qsub myprog.sge

Specifying job requirements

The general philosophy behind SGE is that you should not submit your job to a specific queue or queue instance (although this is possible in principle), but rather define your requirements, and then let SGE decide which queue matches them best (taking into account the current load of the system and other factors). For this "automatic" queue selection to work efficiently and in order to avoid wasting of valuable resources (e.g., requesting much more memory than your job needs, which may prevent the scheduling of jobs of other users), it is important that you give a complete and precise specification of your job requirements in your submission script. The following points are relevant to both serial and parallel jobs.

Runtime

Maximum (wallclock) runtime is specified by h_rt=<hh:mm:ss>. E.g., a maximum runtime of three days is requested by:

#$ -l h_rt=72:0:0

The default runtime of a job is 0:0:0. Thus you should always specify a runtime, unless it is a very short job.

All cluster queues except the "long" queues have a maximum allowed runtime of 8 days. It is highly recommended to specify the runtime of your job as realistically as possible (leaving, of course, a margin of error). If the scheduler knows that, e.g., a pending job is a "fast run" which needs only a few hours of walltime, it is likely that it will start executing much earlier than other jobs with more extensive walltime requirements (so-called backfilling).

If your job needs more than 8 days of walltime, your submission script must contain the following line:

#$ -l longrun=true

It is then automatically directed to one of the "long" queues, which have no runtime limit. The number of long-running jobs per user is limited.

Memory

Maximum memory (physical + virtual) usage of a job is defined by the h_vmem attribute, as in

#$ -l h_vmem=4G

for a job requesting 4 GB of total memory. If your job exceeds the specified memory limit, it gets killed automatically. The default value for h_vmem is 500 MB.

Important: The h_vmem attribute refers to the memory per job slot, i.e. it gets multiplied by the number of slots for a parallel job.

Total memory available for jobs on each compute node:

  • standard compute nodes of HERO (mpcs001..mpcs130): 23 GB
  • big nodes of HERO (mpcb001..mpcb020): 46 GB
  • low-memory nodes of FLOW (cfdl001..cfdl122): 22 GB
  • high-memory nodes of FLOW (cfdh001..cfdh064): 46 GB

If your job needs one (or several) of the "big nodes" of HERO (mpcb001..mpcb020), you must specify your memory requirement and set the Boolean attribute bignode to true. Example: A job in the parallel environment "smp" (see below) requests 12 slots and 3 GB per slot (i.e., h_vmem=3G). This jobs needs 36 GB of memory on a single node in total, and thus can only run on one of the big nodes. The corresponding section of your submission script will then read:

#$ -l h_vmem=3G
#$ -l bignode=true

Similarly, to request one of the high-memory nodes of FLOW (cfdh001..cfdh064), you need to set the attribute highmen to true. Example: for an MPI job with 12 tasks per node and a memory requirement of 3 GB for each task, you would specify:

#$ -l h_vmem=3G
#$ -l highmem=true
Local disk space (HERO only)

Local scratch space is only available on the HERO cluster, since the compute nodes of FLOW are diskless. For requesting, e.g., 200 GB of scratch space, the SGE directive reads:

#$ -l h_fsize=200G

The default value is h_fsize=100M.

The path to the local scratch directory can be accessed in your job script (or other scripts/programs invoked by your job) via the $TMPDIR environment variable. After termination of your job (or if you kill your job manually by qdel), the scratch directory is automatically purged.

Total amount of scratch space available on each compute node:

  • standard nodes (mps001..mpcs130): 800 GB
  • big nodes (mpcb001..mpcb020): 2100 GB

If your job needs more than 800 GB of scratch space, you must request one of the big nodes. Example:

 #$ -l h_fsize=1400G
 #$ -l bignode=true

Parallel environments (PEs)

Example: If you have an MPI program compiled and linked with the Intel Compiler and MPI library, your job submission script might look like follows:

 #$ -pe intelmpi 96  
 #$ -R y

 load module intel/impi
 mpiexec -machinefile $TMPDIR/machines -n $NSLOTS -env I_MPI_FABRICS shm:ofa ./myprog_intelmpi

In that case, the MPI job uses the InfiniBand fabric for communication (the I_MPI_FABRICS variable). Turning on resource reservation (-R y) is highly recommended in order to avoid starving of parallel jobs by serial jobs which "block" required slots on specific hosts.The job requests 96 cores. The allocation rule of this PE is "fill-up", i.e. SGE tries to place the MPI tasks on as few hosts as possible (in the "ideal" case, the program would run on exactly 8 hosts (with cores or slots on each host, but there is no guerantee that this is going to happen).

Please have a look at the directory named Examples in your homedirectory, which contains other examples how to submit parallel (MPI) jobs.

List of all currently available PEs:

  • intelmpi for using the Intel MPI Library, see above.
  • openmpi for using the OpenMPI Library (so far, only supported with the gcc compiler
  • mvapich for MVAPICH library (i.e., InfiniBand interconnects)
  • smp: this PE demands that all requested slots be on the same host (needed for multithreaded applications, like Gaussian single-node jobs, OpenMP, etc.)
  • linda: special PE for Gaussian Linda jobs, see below.

If your job is to run in one of the "long" queues (i.e., requesting more than 8 days of walltime), you must use the corresponding "long" version of the PE: intelmpi_long, openmpi_long, etc.

Note that the above list will grow over time. E.g., it is planned to support OpenMPI with the Intel Compiler (not only the gcc compiler, as is now the case).

... tbc ...


Interactive jobs

Interactive jobs are only allowed for members of certain groups from the Institue of Psychology who have special data pre-processing needs which require manual intervention and cannot be automatized (the prerequesite for writing a batch job script).

Users who are entitled to submit interactive jobs type

qlogin -l xtr=true

on the command line ("xtr" means "extra queue"). After that, a graphical Matlab session can be started by issuing the following two commands:

module load matlab
matlab &

(Sending the Matlab process to the background gives you control over the shell, which may be useful. If you do not specify any memory requirements, your interactive job will be limited to using at most 500MB. If you need more (e.g., 2 GB), you have to request the memory explicitly, as in:

qlogin -l xtr=true -l h_vmem=2G

Note that the syntax is the same as for requesting resource requirements in job submission script (a resource request starts with the "-l" flag).

Monitoring and managing your jobs

A selection of the most frequently used commands for job monitoring and management:

  • qstat: display all (pending, running, ...) jobs of the user (output is empty if user has no jobs in the system).
  • qstat -j <jobid>: get a more verbose output, which is particularly useful when analyzing why your job won't run.
  • qdel <jobid>: kill job with specified ID (users can, of course, only kill their own jobs).
  • qalter: Modify a pending or running job.
  • qhost: display state of all hosts.

Note that there is also a GUI to SGE, invoked by the command qmon

... tbc ...

Array jobs

... are a very efficient way of managing your jobs under certain circumstances (e.g., if you have to run one identical program many times on different data sets, with different initial conditions, etc.). Please see the corresponding Section in the official documentation of Sun Grid Engine

...tbc...

Documentation

  • Grid Engine User Guide
    Note that the above on-line documentation refers to a slightly newer version than installed on our HPC systems (6.2u7 vs. 6.2u5). In practice, that should not make much of a difference, though. Unfortunately, all of the original documentation has disappeared from the Web since the acquisition of SUN by Oracle, and it has since become difficult to get useful on-line documentation for "older" versions of SGE.
  • The following PDFs contain the original documentation of SGE 6.2u5 (converted from the webpages):


Using the Altix UV 100 system

The SGI system is used for very specific applications (in need of a large and highly performant shared-memory system) and can presently only be accessed by the Theoretical Chemistry group. Entitled users may login to the system via ssh (the same rules as for the login nodes of the main system apply, i.e. access is only possible from within the intranet of the University, otherwise you have to establish a VPN tunnel):

abcd1234@uv100.hpc.uni-oldenburg.de

The Altix UV system has a RHEL 6.0 operating system installed. As for the IBM cluster, the modules environment is used.

Compiling and linking applications

It is strongly recommended to use MPT (Message Passing Toolkit), SGI's own implementation of the MPI standard. It is only then that the highly specialized HW architecture of the system can be fully exploited. The MPT module must be loaded both for compiling and (in general) at runtime in order for you application to find the dynamically linked libraries:

module load mpt

Note that MPT is not a compiler. SGI does not provide own compilers for x86-64 based systems. One may, e.g., use the Intel compiler:

module load intel/ics

Basically, you can use the compiler the same way you are accustomed to, and link against the MPT library by setting the flag -lmpi. See the documentation provided below, and also for how to run MPI programs.

Submitting SGE jobs

Only entitled users can submit jobs to the SGI Altix UV system (the jobs of other users would never start). For a job that is to execute on the UV100 system, your SGE job submission script must contain the line

#$ -l uv100=true

From SGE's point of view, the system is treated as a single, big SMP node with 120 cores, 630 GB of main memory available for running jobs, and 11 TB of scratch disk space. So far only the parallel environment smp is configured on the system.

Documentation

(Both of the above are also available as PDF download.)


Application Software and Libraries

Compiler and Development Tools Quantum Chemistry Computational Fluid Dynamics Mathematics Visualisation Libraries

Compiler and Development


Quantum Chemistry


CFD (Computational Fluid Dynamics)

  • PALM -- Large-eddy simulation (LES) model for atmospheric and oceanic flow
  • CFD-Suite STAR-CCM++
  • Weather Research and Forecasting Model WRF/WPS
  • OpenFOAM
  • Spectral Element Code NEK5


Mathematics


Visualisation


Libraries


Advanced Usage

Here will you will find, among others, hints how to analyse and optimize your programs using HPC tools (profiler, debugger, performance libraries), and other useful information.

... tbc ...