Non-public templates: Coordination of Scientific Computing

Here, for documentation, completeness and availability I will list some templates of e-mails and further things I used on a regular basis.

Application for a new user account

So as to apply for a new user account, an eligible user needs to specify three things:

his/her anonymous user-name in the form abcd1234,
the working group (or ideally the unix-group) he will be associated to, and
an approximate data until when the user account will be needed.

No university user account, yet

If the user has no university-wide anonymous user account, yet, he first needs to apply for one. An exemplary e-mail with advice on how to get such a (guest) user account is listed below

 
Sehr geehrter Herr NAME,

um einen Nutzeraccount für das HPC System erhalten zu können müssen Sie bereits
über einen universitätsweiten, anonymen Nutzeraccount verfügen.  Als Gast einer
Arbeitsgruppe können sie einen entsprechenden Guest-Account bei den IT-Diensten
beantragen. Besuchen Sie dazu bitte die Seite

http://www.uni-oldenburg.de/itdienste/services/nutzerkonto/gaeste-der-universitaet/

und wählen Sie die Option "Gastkonto einrichten". Starten sie den Workflow für
das Anlegen eines Gastkontos. Tragen Sie als Verantwortlichen den Leiter der
universitären Organisationseinheit ein, der Ihr Vorhaben unterstützt. Bitten
Sie diesen, die E-Mail die er erhält zu öffnen, den darin enthaltenen Link
aufzurufen und den Antrag zu genehmigen. Das Konto wird dann automatisch
erstellt. Ihr anonymer Nutzeraccount wird die Form "abcd1234" haben.

Um nun ihren Nutzeraccount für das HPC System freischalten zu können senden Sie
mir bitte folgende Details:

1) den anonymen Nutzernamen für den der HPC account erstellt werden soll,
2) den Namen der Arbeitsgruppe der Sie zugeordnet werden sollen,
3) einen voraussichtlichen Gültigkeitszeitraum für den benötigten HPC account.

Sobald Ihr HPC account aktiviert ist werde ich mich mit weiteren Informationen
bei Ihnen melden.

Mit freundlichen Grüßen
Oliver Melchert

User account HPC system: Mail to IT-Services

Once the user supplied the above information, you can apply for a HPC user account at the IT-Service using an e-mail similar to:

 
Mail to: felix.thole@uni-oldenburg.de; juergen.weiss@uni-oldenburg.de
Betreff: [HPC-HERO] Einrichtung eines Nutzeraccounts

Sehr geehrter Herr Thole,
sehr geehrter Herr Weiss,

Hiermit bitte ich um die Einrichtung eines HPC Accounts für 
Herrn NAME

abcd124; UNIX-GROUP

der Account wird voraussichtlich bis DATUM benötigt.

Mit freundlichen Grüßen
Oliver Melchert

If no proper unix group exists, yet, send instead an email similar to the following:

 
Mail to: felix.thole@uni-oldenburg.de; juergen.weiss@uni-oldenburg.de
Betreff: [HPC-HERO] Einrichtung eines Nutzeraccounts

Hallo Felix,
hallo Jürgen,

Hiermit bitte ich um die Einrichtung eines HPC Accounts für Herrn NAME

abcd1234

der Account wird voraussichtlich bis DATUM benötigt.

Herr NAME ist Mitarbeiter der AG "AG-NAME" (AG-URL) von Herrn Prof. NAME AG-LEITER. 
Die entsprechede AG hat noch keine eigene Unix Group! Kann daher eine neue Unix Group 
für die AG angelegt und in die bestehende Gruppenhierarchie eingebunden werden?

Ich schlage hier den Namen 

agUNIX-GROUP-NAME

für die Unix Gruppe vor. Die AG gehört zur Fak. FAKULTAET.

Mit freundlichen Grüßen
Oliver Melchert

User account HPC system: Mail back to user

As soon as you get feedback from the IT-Services that the account was created, send an email to the user similar to the following:

 
Betreff: [HPC-HERO] HPC user account

Sehr geehrter Herr NAME,

die IT-Dienste haben Ihren HPC Account bereits freigeschaltet. Ihr Loginname
ist

abcd1234

und Sie sind der Unix-gruppe

UNIX-GROUP-NAME

zugeordnet. 

Sie verfügen über 100GB Plattenspeicher auf dem lokalen Filesystem (mit
vollem Backup). Wenn Sie über einen begrenzten Zeitraum mehr Speicherplatz
benötigen können Sie mich gerne diesbezüglich anschreiben. Ihren aktuellen
Speicherverbrauch auf dem HPC System können Sie mittels "iquota" einsehen. An
jedem Sonntag werden Sie eine Email mit dem Betreff "Your weekly HPC Quota
Report" erhalten, die Ihren aktuellen Speicherverbrauch zusammenfasst.

Anbei sende ich Ihnen einen Link zu unserem HPC user wiki, auf dem Sie weitere
Details über das lokale HPC System erhalten 
http://wiki.hpcuser.uni-oldenburg.de/index.php?title=Main_Page

Der Beitrag "Brief Introduction to HPC Computing" unter
http://wiki.hpcuser.uni-oldenburg.de/index.php?title=Brief_Introduction_to_HPC_Computing
illustriert einige einfache Beispiele zur Nutzung der verschiedenen
(hauptsächlich parallelen) Anwendungsumgebungen die auf HERO zur Verfügung
stehen und ist daher besonders zu empfehlen. Er diskutiert außerdem einige
andere Themen, wie z.B. geeignetes Alloziieren von Ressourcen und Debugging.

Wenn Sie planen die parallelen Ressourcen von MATLAB auf HERO zu nutzen kann
ich Ihnen die Beiträge "MATLAB Distributed Computing Server" (MDCS) unter 
http://wiki.hpcuser.uni-oldenburg.de/index.php?title=MATLAB_Distributing_Computing_Server 
und "MATLAB Examples using MDCS" unter
http://wiki.hpcuser.uni-oldenburg.de/index.php?title=Matlab_Examples_using_MDCS
empfehlen. Der erste Beiträge zeigt wie man das lokale Nutzerprofil für die
Nutzung von MATLAB auf HERO konfigurieren kann und der Zweite beinhaltet einige
Beispiele und diskutiert gelegentlich auftretende Probleme im Umgang mit MDCS.

Viele Grüße
Oliver Melchert

english variant of the above email:

 
Betreff: [HPC-HERO] HPC user account

Dear NAME,

the IT-Services were now able to activate your HPC account. Your login name to
the HPC system is 

abcd1234

and you are integrated in the group

UNIX-GROUP-NAME

Per default you have 100GB of storage on the local filesystem which is fully
backed up. If you need some more storage over a limited period in time you can
contact me. Note that you can check your memory consumption on the HPC system
via the command "iquota". In addition, on each Sunday you will receive an
email, titled "Your weekly HPC Quota Report", summarizing your current memory
usage. 

Below I sent you a link to the HPC user wiki where you can find further 
details on the HPC system
http://wiki.hpcuser.uni-oldenburg.de/index.php?title=Main_Page

In particular I recommend the "Brief Introduction to HPC Computing" at
http://wiki.hpcuser.uni-oldenburg.de/index.php?title=Brief_Introduction_to_HPC_Computing
which illustrates several basic examples related to different (mostly parallel)
environments the HPC system HERO offers and discusses a variety of other
topics, as, e.g., proper resource allocation and debugging. 

Further, if you plan to use the parallel capabilities of MATLAB on HERO, I
recommend the "MATLAB Distributed Computing Server" (MDCS) page at 
http://wiki.hpcuser.uni-oldenburg.de/index.php?title=MATLAB_Distributing_Computing_Server 
and the "MATLAB Examples using MDCS" wiki page at
http://wiki.hpcuser.uni-oldenburg.de/index.php?title=Matlab_Examples_using_MDCS
These pages summarize how to properly set up your profile for using MATLAB on HERO
and discuss some of the frequently appearing problems.

With kind regards
Oliver

User account HPC system: Mail back to user; Fak 2 (STATA users)

New users from Fak 2 most likely want to use the STATA software. An adapted version of the above email reads

 
Dear MY_NAME,

the IT-Services activated your HPC account already. Your login name to
the HPC system is 

LOGIN_NAME

and you are associated to the unix group

UNIX_GROUP

This is also reflected by the structure of the filesystem on the HCP system.

Per default you have 100GB of storage on the local filesystem which is fully
backed up. If you need some more storage over a limited period in time you can
contact me. Note that you can check your memory consumption on the HPC system
via the command "iquota". In addition, on each Sunday you will receive an
email, titled "Your weekly HPC Quota Report", summarizing your current memory
usage. 

Below I sent you a link to the HPC user wiki where you can find further details
on the HPC system: 

http://wiki.hpcuser.uni-oldenburg.de/index.php?title=Main_Page

If you plan to use the parallel capabilities of STATA on HERO, I recommend the
"STATA" entry at

Main Page > Application Software and Libraries > Mathematics/Scripting > STATA

see: http://wiki.hpcuser.uni-oldenburg.de/index.php?title=STATA
The above page summarizes how to access the HPC System and how to successfully 
submit a STATA job. 

With kind regards
Dr. Oliver Melchert

Temporary extension of disk quota

Sometimes a user from the theoretical chemistry group needs an temporary extension of the available backed-up disk space. Ask him to provide

the total amount of disk space needed (he might check his current limit by means of the unix command iquota)
an estimated data until the extension is required

Mail to IT-Servies

Then send an email similar to the one listed below to the IT-Service

 
Mail to: felix.thole@uni-oldenburg.de; juergen.weiss@uni-oldenburg.de
Betreff: [HPC-HERO] Erhöhung des verfügbaren Festplattenspeichers eines Nutzers 

Hallo Felix,
hallo Jürgen,

der HPC User NAME

abcd1234; UNIX-GROUP

hat darum gebeten seinen Disk Quota vorübergehend zu erhöhen. Er bittet 
um eine Erhöhung auf ein Gesamtvolumen von

500GB

die bis Ende Dezember 2013 benötigt wird. Danach kann er die 
Daten entsprechend archivieren und der Disk Quota könne wider
zurückgesetzt werden.

Viele Grüße,
Oliver

List of users with nonstandard quota

Users that currently enjoy an extended disk quota:

 
NAME                              ID            MEM       LIMIT
jan.mitschker@uni-oldenburg.de    dumu7717 1TB   no limit given
hendrik.spieker@uni-oldenburg.de  rexi0814 300GB Ende September 2013 
wilke.dononelli@uni-oldenburg.de  juro9204 700GB Ende Dezember 2013

Cluster downtime

In case there needs to be a maintenance downtime for the cluster, send an email similar to the following to the mailing list of the HPC users

 
Mail to: hpc-hero@listserv.uni-oldenburg.de
Betreff: [HPC-HERO] Maintenance downtime 11-13 June 2013 (announcement)

Dear Users of the HPC facilities,

this is to inform you about an overly due THREE-DAY MAINTENANCE DOWNTIME

FROM: Tuesday 11th June 2013, 7 am 
TO: Thursday 13th June 2013, 16 pm

This downtime window is required for essential maintenance work regarding
particular hardware components of HERO. Ultimately, the scheduled downtime will
fix longstanding issues caused by malfunctioning network switches.  Please note
that all running Jobs will be killed if they are not finished up to 11th June 7
am. During the scheduled downtime, all queues and filesystems will be
unavailable.  We expect the HPC facilities to resume on Thursday afternoon. 

I will remind you about the upcoming three-day maintenance downtime in 
unregular intervals.

Please accept my apologies for any inconvenience caused
Oliver Melchert

In case the downtime needs to be extended send an email similar to:

 
Mail to: hpc-hero@listserv.uni-oldenburg.de
Betreff: [HPC-HERO] Delay returning the HPC system HERO to production status

Dear Users of the HPC Facilities,

we currently experience a DELAY RETURNING THE hpc SYSTEM TO PRODUCTION STAUTS
since the necessary change of the hardware components took longer than
originally expected. The HPC facilities are expected to finally resume service
by

Friday 14th June 2013, 15:00 

We will notify you as soon as everything is back online. 

With kind regards
Oliver Melchert

you do not need to supply much details, yet. However, if another extension is necessary, you should provide some details otherwise prepare for complaints by the users. So, your email could look similar to:

 
Mail to: hpc-hero@listserv.uni-oldenburg.de
Betreff: [HPC-HERO] Further delay returning the HPC system HERO to production status

Dear Users of the HPC Facilities,

as communicated already yesterday, we currently experience a DELAY RETURNING 
THE hpc SYSTEM TO PRODUCTION STATUS. The delay results from difficulties related to 
the maintenance work on the hardware components of HERO.

The original schedule for the maintenance work could not be kept. Some details
of the maintenance process are listed below:

According to the IT-services, the replacement of the old (malfunctioning)
network switches by IBM engineers worked out well (with no delay). However, the
configuration of the components by pro-com engineers took longer that the
previously estimated single day, causing the current delay.  Once the
configuration process is completed, the IT-service staff needs to perform
several tests, firmware updates and application test which will take
approximately one day.  After the last step is completed, the HPC facilities
will finally return to production status.

In view of the above difficulties we ask for your understanding that the HPC
facilities will not be up until today 15:00. We hope that the HPC facilities
resume service by 

Monday 17th June 2013, 16:00 

We will notify you as soon as everything is back online and apologize for the 
inconvenience.
 
With kind regards
Oliver Melchert

once the HPC is up and ready send an email similar to:

 
Mail to: hpc-hero@listserv.uni-oldenburg.de
Betreff: [HPC-HERO] HPC systems have returned to production

Dear Users of the HPC Facilities,

this is to inform you that the maintenance work on the HPC systems have been
completed and the HPC component HERO has returned to production: HERO accepts
logins and has already started to process jobs.

Thank you for your patience and please accept my apologies for the extension of
the maintenance downtime and any inconvenience this might have caused
Oliver Melchert

Large Matlab Jobs

Some Matlab users send jobs with the maximally allowed number of workers (i.e. slots in Matlab jargon), i.e. 36. Usually these Jobs get distributed over lots of hosts. E.g.:

 
job-ID  prior   name       user         state submit/start at     queue                  master ja-task-ID 
----------------------------------------------------------------------------------------------------------
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs004 MASTER        
                                                                  mpc_std_shrt.q@mpcs004 SLAVE         
                                                                  mpc_std_shrt.q@mpcs004 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs008 SLAVE         
                                                                  mpc_std_shrt.q@mpcs008 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs032 SLAVE         
                                                                  mpc_std_shrt.q@mpcs032 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs034 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs036 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs038 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs043 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs045 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs052 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs066 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs070 SLAVE         
                                                                  mpc_std_shrt.q@mpcs070 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs076 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs080 SLAVE         
                                                                  mpc_std_shrt.q@mpcs080 SLAVE         
                                                                  mpc_std_shrt.q@mpcs080 SLAVE         
                                                                  mpc_std_shrt.q@mpcs080 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs087 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs089 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs090 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs091 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs099 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs107 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs110 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs111 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs112 SLAVE         
                                                                  mpc_std_shrt.q@mpcs112 SLAVE         
1040328 0.51109 Job16      nixi9106     r     10/07/2013 18:19:48 mpc_std_shrt.q@mpcs117 SLAVE         
                                                                  mpc_std_shrt.q@mpcs117 SLAVE         
                                                                  mpc_std_shrt.q@mpcs117 SLAVE         
                                                                  mpc_std_shrt.q@mpcs117 SLAVE         
                                                                  mpc_std_shrt.q@mpcs117 SLAVE         
                                                                  mpc_std_shrt.q@mpcs117 SLAVE

If the jobs have lots of I/O this puts a big strain on the filesystem. For these large jobs the "parallel job memory issue" is a problem. I.e. the master process has to account (in terms of memory) for all the connections to the other host machines. Then, if the master process runs out of memory the job gets killed. More common are 8 slot jobs and even more common are jobs with even less slots.