Difference between revisions of "Quickstart Guide"

Latest revision as of 12:46, 2 July 2020

This is a quick start guide to help you start to work on the HPC-clusters CARL and EDDY.

If you have questions that arent answered in this guide, please contact the Scientific Computing

HPC Cluster Overview

The HPC cluster, located at the Carl von Ossietzsky Universität Oldenburg, consists of two clusters named CARL and EDDY. They are connected via FDR Infiniband for parallel computations and parallel I/O. CARL uses a 8:1 blocking network topology and EDDY uses a fully non-blocking network topology. Further, they are connected via an ethernet network for management and IPMI. They also share an GPFS parallel file system with about 900TB net capacity and 17/12 GB/s paralell read/write performance. Additional storage is provided by the central NAS-system of the IT-services.

Both clusters are based on the Lenovo NeXtScale system.

CARL (271 TFlop/s theoretical peak performance):

327 compute nodes (9 of these with a GPU)
7.640 CPU cores
77 TB of RAM
360TB local storage

EDDY (201 TFlop/s theoretical peak performance):

244 compute nodes (3 of these with a GPU)
5.856 CPU cores
21 TB of RAM

For more detailed informations about the cluster, you can visit our Overview.

Account creation

If you want to use the HPC cluster, you need to have a university account (e.g. abcd1234) and the account needs to be part of a cluster group. Cluster groups are basically the work groups which are using the cluster. (You can find a list and other informations about that here: Unix groups) Until now we were using a form on the home page of the scientific computing were you could request an HPC account. This process is now automated and you are able to join a group on your own. To do so, follow these simple steps:

visit servicedesk.uni-oldenburg.de
click on "SelfServiceDesk verwenden" ("Use SelfServiceDesk")
after a succesfull login, you will see many buttons for different services. To create an HPC account, click on "HPC Account beantragen und bearbeiten" ("Edit or request a HPC account")
on the bottom, you can choose the cluster group you want to join
choose one and submit your choice by clicking on "Übernehmen" ("Submit")

An administrator will then check, and most likely confirm your request. After that you will be able to use your university account for logging into the cluster.

Note: If you accidentally joined the wrong group or if you simply want to switch to an other group, you can follow the steps described above too. If you do so, keep in mind that you will have to adjust the group assingment of your files afterwards. The command for this operation looks like this:

chgrp -R YOUR_NEW_GROUP $HOME

Replace "YOUR_NEW_GROUP" with the name of the unix-group you joined. This will recursively change the permission of all your files in $HOME (and $DATA and $WORK).

If there are any files or directories that shouldnt be affected by the group change, you could use find to specifically change the rights:

find $HOME -group YOUR_OLD_GROUP -exec chgrp YOUR_NEW_GROUP {} \;

This will will only adjust the rights of files/directories which currently belong to the group "YOUR_OLD_GROUP".

Login

You can use a SSH client of your choice or the command line on linux computers to connect to the cluster via ssh. To do so, use either

carl.hpc.uni-oldenburg.de

or

eddy.hpc.uni.oldenburg.de

as the address. A full command on the linux command line would look like this:

ssh abcd1234@carl.hpc.uni-oldenburg.de

For further informations about the login, please look at the guide located on the page Login to the HPC cluster.

File System

The cluster offers two files systems: The GPFS Storage Server (GSS) and the central storage system of the IT services.

GPFS Storage Server (GSS):

parallel file system
total (net) capacity is about 900TB
R/W performance is up to 17/12 GB/s over FDR Infiniband
can be mounted using SMB/NFS
used as the primary storage for HPC (for data that is read/written by compute nodes)
no backup!

Central storage system of the IT services (Isilon Storage System):

NFS-mounted $HOME-directories
high availability
snapshots
back
used as permanent storage!

HOME: (Informations about the old home directories can be found below!)

Path: /user/abcd1234
Environment variable: $HOME

DATA:

Path: /nfs/data/abcd1234
Environment variable: $DATA

WORK: (Informations about the old work directories can be found below!)

Path: /gss/work/abcd1234
Environment variable: $WORK

Scratch:

Path: /scratch
Environment varibale: $TMPDIR

Remember: "/scratch" (or $TMPDIR) is only available if you demanded it in your jobscript. Further informations can be found here.

If you look at your "new" home directory, you will find two links: old_home_abcd1234 -> /bright/user/../abcd1234 and old_work_abcd1234 -> /bright/data/work/../abcd1234. These lead to your old home directory on HERO and FLOW. Please copy everything you really need to your new home directory (or basically everywhere you want). You cant write or do anything else on the old home directories through the new cluster (you can still access your data by logging into the old cluster). The old home directories will be available for some time, but not forever. Please make sure you have everything copied and backed up.

Further information can be found on the related page in the wiki: File system and Data Management

Software and Environment

There are many pre-installed software packages like compilers, libraries, pre- and postprocessing tools and further applications provided. We are using the command module to manage them.

With this command you can:

list the available software
access/load software (even in different versions)
etc..

Example: Show the software on CARL and EDDY and load the Intel compiler

[abcd1234@hpcl001 ~]$ module avail
-----------/cm/shared/uniol/modules/compiler-----------
... icc/2016.3.210
[abcd1234@hpcl001 ~]$ module load icc/2016.3.210
[abcd1234@hpcl001 ~]$ module list
Currenty loaded modules: ... icc/2016.3.210 ...

Basic Job Submission

The new workload manager and job management queueing system on CARL and EDDY is called SLURM. SLURM is a free and open-source job scheduler for Linux and Unix-like kernels and is used on about 60% of the world's supercomputers and computer clusters.

To submit a job on the HPC cluster you need two things:

the command sbatch
a jobscript

If you have your jobscript (an example is linked at the end) you can simply queue it with the command:

sbatch -p carl.p my_first_job.job

The option "-p" defines the used partition. Please keep in mind that choosing the right partition will allow your job to start faster. If you choose the wrong one it might take a while for your job to start. Therefore we recommend you to look at the wiki article about partitions.

If you did submit your job sucessfully, you can check its status with

squeue -u abcd1234

As always: "abcd1234" is just a placeholder for your own username! Informations like JOBID, PARTITION, JOBNAME, USER, TIME and the amound of NODES will be displayed. (simply running squeue without any arguements will list every job that is running/waiting etc. at the moment for all users)

Further information about the job submission and an example jobscript can be found on the related page in the wiki: SLURM Job Management (Queueing) System

@@ Line 1: / Line 1: @@
 This is a quick start guide to help you start to work on the HPC-clusters CARL and EDDY.
-If you have questions that arent answered in this guide, please contact the servicedesk of the it-services ('''servicedesk@uni-oldenburg.de''') or '''hpc@uni-oldenburg.de'''.
+If you have questions that arent answered in this guide, please contact the {{sc}}
 ==HPC Cluster Overview==
@@ Line 10: / Line 9: @@
 '''CARL (271 TFlop/s theoretical peak performance)''':
-*327 compute nodes
+*327 compute nodes (9 of these with a GPU)
 *7.640 CPU cores
 *77 TB of RAM
@@ Line 16: / Line 15: @@
 '''EDDY (201 TFlop/s theoretical peak performance)''':
-*244 compute nodes
+*244 compute nodes (3 of these with a GPU)
 *5.856 CPU cores
 *21 TB of RAM
 For more detailed informations about the cluster, you can visit our [[HPC Facilities of the University of Oldenburg 2016 | Overview]].
+==Account creation==
+If you want to use the HPC cluster, you need to have a university account (e.g. abcd1234) and the account needs to be part of a cluster group. Cluster groups are basically the work groups which are using the cluster. (You can find a list and other informations about that here: [[Unix groups]]) Until now we were using a form on the home page of the scientific computing were you could request an HPC account. This process is now automated and you are able to join a group on your own. To do so, follow these simple steps:
+*visit [https://servicedesk.uni-oldenburg.de servicedesk.uni-oldenburg.de]
+*click on "SelfServiceDesk verwenden" ("Use SelfServiceDesk")
+*after a succesfull login, you will see many buttons for different services. To create an HPC account, click on "HPC Account beantragen und bearbeiten" ("Edit or request a HPC account")
+*on the bottom, you can choose the cluster group you want to join
+*choose one and submit your choice by clicking on "Übernehmen" ("Submit")
+An administrator will then check, and most likely confirm your request. After that you will be able to use your university account for logging into the cluster.
+'''Note:''' If you accidentally joined the wrong group or if you simply want to switch to an other group, you can follow the steps described above too. If you do so, keep in mind that you will have to adjust the group assingment of your files afterwards. The command for this operation looks like this:
+ chgrp -R YOUR_NEW_GROUP $HOME
+Replace "YOUR_NEW_GROUP" with the name of the unix-group you joined. This will recursively change the permission of all your files in $HOME (and $DATA and $WORK).
+If there are any files or directories that shouldnt be affected by the group change, you could use ''find'' to specifically change the rights:
+ find $HOME -group YOUR_OLD_GROUP -exec chgrp YOUR_NEW_GROUP {} \;
+This will will only adjust the rights of files/directories which currently belong to the group "YOUR_OLD_GROUP".
 ==Login==
+You can use a SSH client of your choice or the command line on linux computers to connect to the cluster via ssh. To do so, use either
+ carl.hpc.uni-oldenburg.de
+or
+ eddy.hpc.uni.oldenburg.de
+as the address. A full command on the linux command line would look like this:
+ ssh abcd1234@carl.hpc.uni-oldenburg.de
+For further informations about the login, please look at the guide located on the page [[Login | Login to the HPC cluster]].
 ==File System==
+The cluster offers two files systems: The GPFS Storage Server (GSS) and the central storage system of the IT services.
+'''GPFS Storage Server (GSS):'''
+*parallel file system
+*total (net) capacity is about 900TB
+*R/W performance is up to 17/12 GB/s over FDR Infiniband
+*can be mounted using SMB/NFS
+*used as the primary storage for HPC (for data that is read/written by compute nodes)
+*'''no backup!'''
+'''Central storage system of the IT services (Isilon Storage System):'''
+*NFS-mounted $HOME-directories
+*high availability
+*snapshots
+*back
+*'''used as permanent storage!'''
+'''HOME:''' (Informations about the old home directories can be found below!)
+*'''Path:''' /user/abcd1234
+*'''Environment variable:''' $HOME
+'''DATA:'''
+*'''Path:''' /nfs/data/abcd1234
+*'''Environment variable:''' $DATA
+'''WORK:''' (Informations about the old work directories can be found below!)
+*'''Path:''' /gss/work/abcd1234
+*'''Environment variable:''' $WORK
+'''Scratch:'''
+*'''Path:''' /scratch
+*'''Environment varibale:''' $TMPDIR
+:'''Remember''': "/scratch" (or $TMPDIR) is only available if you demanded it in your jobscript. Further informations can be found [[File system and Data Management#Scratch space / TempDir | here]].
+If you look at your "new" home directory, you will find two links: ''old_home_abcd1234 -> /bright/user/../abcd1234'' and ''old_work_abcd1234 -> /bright/data/work/../abcd1234''. These lead to your old home directory on HERO and FLOW. Please copy everything you '''really''' need to your new home directory (or basically everywhere you want). You cant write or do anything else on the old home directories through the new cluster (you can still access your data by logging into the old cluster). The old home directories will be available for some time, but not forever. Please make sure you have everything copied and backed up.
+Further information can be found on the related page in the wiki: [[File system and Data Management]]
 ==Software and Environment==
+There are many pre-installed software packages like compilers, libraries, pre- and postprocessing tools and further applications provided. We are using the command '''module''' to manage them.
+With this command you can:
+*list the available software
+*access/load software (even in different versions)
+*etc..
+Example: Show the software on CARL and EDDY and load the '''Intel compiler'''
+ [abcd1234@hpcl001 ~]$ module avail
+ -----------/cm/shared/uniol/modules/compiler-----------
+ ... icc/2016.3.210
+ [abcd1234@hpcl001 ~]$ module load icc/2016.3.210
+ [abcd1234@hpcl001 ~]$ module list
+ Currenty loaded modules: ... icc/2016.3.210 ...
 ==Basic Job Submission==
+The new workload manager and job management queueing system on CARL and EDDY is called [[SLURM Job Management (Queueing) System | SLURM]]. SLURM is a free and open-source job scheduler for Linux and Unix-like kernels and is used on about 60% of the world's supercomputers and computer clusters.
+To submit a job on the HPC cluster you need two things:
+* the command '''sbatch'''
+* a jobscript
+If you have your jobscript (an example is linked at the end) you can simply queue it with the command:
+ sbatch -p carl.p my_first_job.job
+The option "'''-p'''" defines the used partition. Please keep in mind that choosing the right partition will allow your job to start faster. If you choose the wrong one it might take a while for your job to start. Therefore we recommend you to look at the wiki article about [[Partitions | partitions]].
+If you did submit your job sucessfully, you can check its status with
+ squeue -u abcd1234
+As always: "'''abcd1234'''" is just a placeholder for your own username! Informations like JOBID, PARTITION, JOBNAME, USER, TIME and the amound of NODES will be displayed. (simply running squeue without any arguements will list every job that is running/waiting etc. at the moment for all users)
+Further information about the job submission and an example jobscript can be found on the related page in the wiki: [[SLURM Job Management (Queueing) System]]

Difference between revisions of "Quickstart Guide"

Latest revision as of 12:46, 2 July 2020

Contents

HPC Cluster Overview

Account creation

Login

File System

Software and Environment

Basic Job Submission

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Topics

Tools