Difference between revisions of "Welcome to the HPC User Wiki of the University of Oldenburg"

From HPC users
Jump to navigationJump to search
Line 95: Line 95:
==== Submitting jobs  ====
==== Submitting jobs  ====


Sample job submission scripts for both serial and parallel jobs are provided in the subdirectory <tt>Examples</tt> of your homedirectory. You may have to adapt these scripts as needed.
Sample job submission scripts for both serial and parallel jobs are provided in the subdirectory <tt>Examples</tt> of your homedirectory. You may have to adapt these scripts as needed. Note that a job submission script consists of two principal parts:
 
* SGE directives (lines starting with the "magic" characters <tt>#$</tt>), which fall into three categories:
** general options,
** resource requirements (introduced by the <tt>-l</tt> flag), and
** options for parallel jobs.
 
* Commands to be executed by the job (your program, script, etc.), including the necessary set-up of the environment for the application/program to run correctly (loading of modules etc.).


==== Choosing the right queue ====
==== Choosing the right queue ====
Line 113: Line 120:
... tbc ...
... tbc ...


==== Available Queues ====


==== Monitoring jobs ====
==== Interactive jobs ====
 
 
 
 
==== Monitoring jobs ====
 
 
==== Documentation ====
 


== Application Software and Libraries  ==
== Application Software and Libraries  ==

Revision as of 04:35, 20 April 2011

Note: This is a first, preliminary version (v0.01) of the HPC User Wiki. Its primary purpose is to get you started with our new clusters (FLOW and HERO), enabling you to familiarize with these systems and gather some experience. More elaborate, updated versions will follow, so you may want to check these pages regularly.


Introduction

Presently, the central HPC facilities of the University of Oldenburg comprise three systems:

  • FLOW (Facility for Large-Scale COmputations in Wind Energy Research)
    IBM iDataPlex cluster solution, 2232 CPU cores, 6 TB of (distributed) main memory, QDR InfiniBand interconnect (theoretical peak performance: 24 TFlop/s).
  • HERO (High-End Computing Resource Oldenburg)
    Hybrid system composed of two components:
    • IBM iDataPlex cluster solution, 1800 CPU cores, 4 TB of (distributed) main memory, Gigabit Ethernet interconnect (theoretical peak performance: 19.2 TFlop/s),
    • SGI Altix UltraViolet shared-memory system ("SMP" component), 120 CPU cores, 640 GB of globally addressable memory, NumaLink5 interconnect (theoretical peak performance: 1.3 TFlop/s).
  • GOLEM: older, AMD Opteron-based cluster with 390 cores and 800 GB of (distributed) main memory (theoretical peak performance: 1.6 TFlop/s).

FLOW and HERO use a common, shared storage system (high-performance NAS Cluster) with a net capacity of 130 TB.

FLOW is employed for computationally demanding CFD calculations in wind energy research, conducted by the Research Group TWiST (Turbulence, Wind Energy, and Stochastis) and the ForWind Center for Wind Energy Research. It is, to the best of our knowledge, the largest system in Europe dedicated solely to that purpose.

The main application areas of the HERO cluster are Quantum Chemistry, Theoretical Physics, and the Neurosciences and Audiology. Besides that, the system is used by many other research groups of the Faculty of Mathematics and Science and the Department of Informatics of the School of Computing Science, Business Administration, Economics, and Law.

Hardware Overview

(Westmere-EP, 2.66 GHz)

(Nehalem-EX, "Beckton")

Basic Usage

Logging in to the system

From within the University (intranet)

Within the internal net of the University, access to the systems is granted via ssh. Use your favorite ssh client like OpenSSH, PuTTY, etc. For example, on a UNIX/Linux system, users of FLOW may type on the command line (replace "abcd1234" by your own account):

ssh abcd1234@flow.hpc.uni-oldenburg.de

Similarly, users of HERO login by typing:

ssh abcd1234@hero.hpc.uni-oldenburg.de

Use "ssh -X" for X11 forwarding (i.e., if you need to export the graphical display to your local system).

For security reasons, access to the HPC systems is denied from certain subnets. In particular, you cannot login from the WLAN of the University (uniolwlan) or from "public" PCs (located, e.g., in Libraries, PC rooms, or at other places).

From outside the University (internet)

First, you have to establish a VPN tunnel to the University intranet. After that, you can login to HERO or FLOW via ssh as described above. The data of the tunnel are:

Gateway       : vpn2.uni-oldenburg.de
Group name    : hpc-vpn
Group password: hqc-vqn

Cf. the instructions of the IT Services on how to configure the Cisco VPN client. For the HPC systems, a separate VPN tunnel has been installed, which is only accessible for users of FLOW and HERO. Therefore, you have to configure a new VPN connection and enter the data provided above. For security reasons, you cannot login to FLOW or HERO if you are connected to the intranet via the "generic" VPN tunnel of the University.


User Environment

Compiling and linking

This section will be elaborated later and then provide more detailed information. For the time being, we only give a very brief overview on how to invoke the compilers and linkers and generate executables.

Serial programs

Intel compiler

Documentation

Parallel (MPI) programs

Two methods:

  • wrapper script (usually preferred method, since it keeps track

Job Management (Queueing) System

The queueing system employed to manage user jobs for FLOW and HERO is Sun Grid Engine (SGE). For first-time users (especially those acquainted with PBS-based systems), some features of SGE may seem a little unusual and certainly need some getting-accustomed-to. In order to efficiently use the available hardware resources (so that all users may benefit the most from the system), a basic understanding of how SGE works is indispensable. Some of the points to keep in mind are the following:

  • Unlike other (e.g., PBS-based) queueing systems, SGE does not "know" the concept of "nodes" with a fixed number of CPUs (cores) and users specifying the number of nodes they need, along with the number of CPUs per node, in their job requirements. Instead, SGE logically divides the cluster into slots, where each "slot" may be thought of as a single CPU core. The scheduler assigns free slots to pending jobs. Since in the multi-core area each host offers many slots, this will, in general, lead to jobs of different users running concurrently on the same host (provided that there are sufficient resources like memory, disk space etc. to meet all requirements of all jobs, as specified by the users who submitted them) and usually guarantees efficient resource utilization.
  • While the scheduling behavior described above may be very efficient in optimally using the available hardware resources, it will have undesirable effects on parallel (MPI, LINDA, ...) jobs. E.g., an MPI job requesting 24 slots could end up running 3 tasks on one host, 12 tasks on another host (fully occupying this host, if it is a server with 2 six-core CPUs, as happens with our clusters), and 9 tasks on a third host. Clearly, such an unbalanced configuration may lead to problems. For certain jobs (multithreaded applications) it is even mandatory that all slots reside on one host (typical examples: OpenMP programs, Gaussian single-node jobs).
    To deal with the specific demands of parallel jobs, SGE offers so-called parallel environments (PEs) which are largely configurable. Even if your job does not need several hosts, but runs on only one host using several or all cores of that host, you must specify a parallel environment. It is of crucial importance to choose the "correct" parallel environment (meeting the requirements of your application/program) when submitting a parallel job.
  • Another "peculiarity" of SGE (as compared to its cousins) are the concepts of cluster queues and queue instances. Cluster queues are composed of several (typically, many) queue instances, with each instance associated with one particular host. A cluster queue may have a name like, e.g., standardqueue.q, where the .q suffix is a commonly followed convention. Then the queue instances of this queue has names like, e.g. standardqueue.q@host001, standardqueue.q@host002, ... (note the "@" which acts as a delimiter between the queue name and the queue instance).
    In general, each host will hold several queue instances belonging to different cluster queues. E.g. there may be a special queue for long-running jobs and a queue for shorter jobs, both of which share the same "physical" machines but have different policies. To avoid oversubscription, resource limits can be configure for individual hosts. Since resource limits and other, more complex attributes can also be associated with cluster queues and even queue instances, the system is highly flexible and can be customized for specified needs. On the other hand, the configuration quickly tends to get rather complex, leading to unexpected side effects. E.g., PEs grab slots from all queue instances of all cluster queues they are associated with. Thus, a parallel job may occupy slots on one particular host belonging to different queue instances on that host. While this is usually no problem for the parallel job itself, it blocks resources in both cluster queues which may be unintended. For that reason, it is common practice to associate each PE with one and only one cluster queue and define several (possibly identically configured) PEs in order to avoid that a single PE spans several cluster queues.

Submitting jobs

Sample job submission scripts for both serial and parallel jobs are provided in the subdirectory Examples of your homedirectory. You may have to adapt these scripts as needed. Note that a job submission script consists of two principal parts:

  • SGE directives (lines starting with the "magic" characters #$), which fall into three categories:
    • general options,
    • resource requirements (introduced by the -l flag), and
    • options for parallel jobs.
  • Commands to be executed by the job (your program, script, etc.), including the necessary set-up of the environment for the application/program to run correctly (loading of modules etc.).

Choosing the right queue

General philosophy: specify requirements, let SGE decide which queue your job best runs in (taking into account the current load of the system and other factors). However, in order to avoid undesirable side effects,

Running serial programs

Don't forget to load modules so that your program finds runtime libraries it needs

Running parallel programs

SMP

LINDA

... tbc ...


Interactive jobs

Monitoring jobs

Documentation

Application Software and Libraries

Computational Chemistry

Gaussian

MOLCAS

not yet installed ... tbc ...

MOLPRO

not yet installed

Matlab

Advanced Usage

Here will you will find, among others, hints how to analyse and optimize your programs using HPC tools (profiler, debugger, performance libraries), and other useful information.

... tbc ...