Difference between revisions of "FAQ"

From HPC users
Jump to navigationJump to search
Line 97: Line 97:
If you choose the fitting point of time of your snapshot, navigate to the desired/lost file and copy it into the primary location.
If you choose the fitting point of time of your snapshot, navigate to the desired/lost file and copy it into the primary location.
For more on this see the corresponding [[File_system_and_Data_Management#Snapshots_on_the_ESS wiki page]].
For more on this see the corresponding [[File_system_and_Data_Management#Snapshots_on_the_ESS wiki page]].


=== ''Question:'' I regularly get error messages concerning the language setting and locale.  ===
=== ''Question:'' I regularly get error messages concerning the language setting and locale.  ===


'''Answer: '''  This error is caused by the difference between the language settings of the host and client system.
'''Answer: '''  This error is caused by the difference between the language settings of the host and client system.

Revision as of 15:02, 4 May 2020

Introduction

In our Wiki, you will find a lot of often very detailed information about working in our HPC environment. However, for beginners, it can be very challenging to find the right start.

This is where our FAQ is supposed to be of help. It is designed for the very beginner and links to our Wiki resources when needed.

But also our advanced users could find some of the answers helpful.


If you think, that there are some important questions/answers missing, please let us know. This whole Wiki is a work which is permanently in progress.

F.A.Q.

The very basics - about your account

Question: What exactly is an HPC Cluster?

Answer: An HPC cluster is a group of several high-performance computers. HPC clusters are used when the performance of one ordinary computer is no longer sufficient to perform (scientific) computations.
For comparison: An average well equipped PC has a processing unit (CPU) with ca. 4 cores and 8 - 12 Gigabyte of RAM. Our standard nodes have 24 cores with 256 GB RAM for each core. If a complete node is used to full capacity, 6144 gigabytes of RAM can be used! (A node can be seen as a single computer within the cluster)

Question: Am I permitted to work on the cluster?

Answer: Basically, every student or scientific staff member has the right to work on the HPC cluster, as long as the computations are scientifically legitimated (which of course includes the progress of the students' education).

Nonetheless, there are a few things to consider. There are three common use cases, which we will briefly describe:

  • You are writing your Bachelor / Master / PhD thesis. In this case, it is very likely that you are already part of a workgroup. Just tell your thesis advisor that you need to take your work to the cluster and create a request (see next question).
  • You take place at a seminar which happens to work with the cluster. In this case, you don't have to do anything. Your university lecturer will take care about everything regarding the HPC login and you will get provisional login data. Your personal user account will not be touched. But after the course, you won't be able/permitted to keep using that account!
  • You are not writing on your thesis, you are not taking place in a seminar and you are not part of a workgroup. But you want to use the HPC anyway. If this is the case, please contact us at hpcsupport@uol.de. Either you will be transferred to a fitting group (after consulting the corresponding professor) or you could get an own workgroup. Either way, we will very likely find a solution that fits your needs.

Question: I decided to work on the cluster. How do I get access?

Answer: If you want to get access to our cluster, you need to be part of a workgroup as mentioned above. If you are part of a workgroup, you can request access via our Self Service Portal of our ServiceDesk. Since we already have a step-by-step description on how to start a request, we refer to the instruction page. If your workgroup situation is unclear, just write to us at hpcsupport@uol.de.

Question: I now have access rights for the cluster. How do I log in?

Answer: First of all, congratulations on your new HPC membership!
Now you can start working on our cluster. Depending on your operating system (Windows, Linux, or Mac), the procedure is slightly different. If you have the privilege of choice, we would always recommend Linux, since the communication Linux -> Linux is always least prone to problems (the HPC cluster environment is based on the Red Hat Enterprise Linux distribution.)
To make it short: On Linux, you open a terminal and type in

ssh abcd1234@carl.hpc.uni-oldenburg.de

On Windows, you need to type in the same address, but you additionally need a ssh compatible program like MobaXterm or Putty.
But to avert redundancy, we refer to our wiki page about login where you can find a more detailed description on how to access the cluster with Linux and/or Windows.

Working on the cluster

Question: I want to start computing. What are the first steps?

Answer: Basically, you need two things:
(1) A software module and (2) a job script.
(1) Let's assume that you already successfully logged in. The first thing you need to know is which software you will need to use. If you need an overview of the software that is currently available on our cluster, take a look at our software register or type in ml av to get a software list for your current environment.
We go on and assume, that you chose the software you need for your calculations. Let's pretend it's EGSnrc that you need to work with. If you want to use specific software, you always have to load it first. So we take a look at the software's page and we see, that it is installed on the environment hpc-env/6.4. (Fortunately, EGSnrc has a detailed software page on our wiki.)
This means, we have to load the environment first and then the software module:

module load hpc-env/6.4
module load EGSnrc

(You can abbreviate module load with ml.)
Now, where EGSnrc is loaded, you could start to use it. But you are currently logged in to one of our five login nodes (hpcl001-hpcl005). What you need to do, is to transfer your calculations to another node. For this, you need to use a (2) sbatch script (or job script) with which you can bundle your commands to one job, transfer it to another node, and allocate specific system resources to the job. After writing the script, you submit the sbatch script with SLURM. We describe this procedure here, but for EGSnrc, there is an additional script example

Creating job scripts is mandatory on our cluster!
Job scripts don't just allocate system resources to your tasks, SLURM also queues every job so that the resources are shared fairly.

Question: I need to work with specific software (versions). What can I do?

Answer: There are three different ways to get new software. But before that, you should check, if your software maybe is already installed:

module spider desired_software

You should also check our software register. If you are sure, that we currently don't provide the software you need, you have the following options:

  • Ask us to install your software packet as a module. Write to us at hpcsupport@uol.de and name the software, and the source address if you have it by hand (e.g. GitHub, homepage, etc).
  • Install it by and for yourself with Conda
  • Create a container containing your desired software packet with Singularity

But especially if you think, that it is a software that could be of need for one or more other scientists, you should prefer the first method and write to us. That way, we can provide software for everybody.

Question: I'm not at the university right now. Can I use the cluster from home?

Answer: Yes, you can.
But since our cluster is only permitted to be used on the university's ground, you will need to use a VPN client. This way, your computer will build a bridge to the university's network and act like you are working at the campus.
When the connection is set up, you can start working as usual.

Question: I work with a significant amount of data. Does it matter, where I store them?

Answer: YES, it matters!

There are four different file systems.

  • $HOME (1TB): Here you store the most important and frequently used data, like scripts, results from data analysis, etc.
    • Snapshots and backup system.
  • $DATA (20TB): Here you can store data from simulations for ongoing analysis, etc.
    • Fast read/write access, snapshots, backup system.
  • $WORK (50TB): Here, you store the files during the simulations. If you need the same files or results in some week, keep them here. If you won't touch them for a long time, please transfer the important data to $DATA and delete the rest.
    • Fast read/write access, neither snapshots nor backup system.
  • $SCRATCH (1-2TB per node/per job): This file system significantly differs from the other ones: Every data that is being used on $SCRATCH will be deleted after the job ended immediately. In return, this storage is extremely fast. So just use it for high I/O jobs (e.g. random access) and ALWAYS write your job scripts in a way, that the results are moved to $DATA before is ends, otherwise everything was in vain.
    • USE WITH CARE! absurdly fast, but volatile storage. Neither backup nor snapshots (naturally).

For more information, take a look at the corresponding wiki page.

TLDR: Use $WORK to do simulations and store the results to $DATA.

Question: I accidentally deleted / overwrote an important file of mine! Is there any way to undo this mistake?

Answer: Yes there is! (But depending on how fast you noticed the deletion)
On $HOME and $DATA we have a snapshot system for this very issue. Just navigate to the missing file's directory and type in

cd .snapshots

Here you will see the backups of the last 30 days. If your accident happened on $HOME and on this very day, you even have access to snapshots created hourly. If you choose the fitting point of time of your snapshot, navigate to the desired/lost file and copy it into the primary location. For more on this see the corresponding File_system_and_Data_Management#Snapshots_on_the_ESS wiki page.

Question: I regularly get error messages concerning the language setting and locale.

Answer: This error is caused by the difference between the language settings of the host and client system. The error can occur, for example, if the English configured cluster is accessed from a German based operating system. The solution is a single command line on the cluster side: export LC_ALL="en_US.utf8" You can also attach this command to the ~/.bashrc file if you are often confronted with the error message.

Groups and Accounts

Question: I'm in a new workgroup. How do I change the Unix group?

Answer: You can follow the same steps as mentioned above at how do I get access?. Just request access to your new workgroup and your groups will be changed automatically. But it's best to tell us in the web form, that this is a change request.

Question: I started working at the university and have a second account now. How can I transfer the files between my two accounts??

Answer: With your old account, you can transfer the files you need to your new account with rsync:

rsync -avz $HOME/source_directory abcd1234@carl.hpc.uni-oldenburg.de:/user/abcd1234/target_directory

Where abcd1234 is your new account.
See also: File System and Data Management

Question: I recently changed my research group. How can I also change my unix group on the cluster?

Answer: You can login to the selfservice-desk of the university and request a group change there (go to IT Services and then Wissenschaftliches Rechnen, finally click on Zugang beantragen). Please note that you will have to manually change the group membership of your files and directories after you have been assigned to the new unix group.

Jobs and Queue

Question: How can see the status of my jobs in the job queue?

Answer: To list your own jobs in the queue you can use the command

$ squeue -u $USER
     JOBID PARTITION     NAME        USER ST   TIME  NODES NODELIST(REASON)
  12345678    carl.p     JobName abcd1234 PD   0:00      1 (Resources)

The state of the job is shown in the coloumn ST and is typically R for running and PD for pending. Other states are possible but should not last very long (when this happens please contact Scientific Computing). If you add the option -l to the squeue-command the state will be written out.
Using squeue --help will display all possible commands of queue and briefly describe what they are doing.


Question: My job is in the pending (PD) state, why?

Answer: The squeue-command (see question before) prints a nodelist for running jobs and for pending jobs in brackets the reason, why a job is pending. The most common reasons are Resources and Priority which basically means your job is waiting for the requested resources. Other reasons may show up, here is a list with explanations:

  • Resources: Jobs is waiting for resources and will start when they become available.
  • Priority: Other jobs have higher priority, your job will start afterwards when resources are available.
  • PartitionTimeLimit: Your job has a time limit longer than 21 days and will not start. Change the time limit to 21 days or less.
  • ReqNodeNotAvail: This typically shows up when a downtime for maintenance is scheduled and a reservation is in place. Your job time limit is longer than the time until the maintenance. Unless you reduce the time limit your job will start after the maintenance. Note, that squeue will also list nodes that are unavailable for other reasons which can be misleading.

All possible reasons can be found in the SLURM documentation

Error Messages

Some error messages are hard to understand. Here are some possible solutions:

Question: What does the message "srun: error: PMK_KVS_Barrier duplicate request from task 0" mean?

Answer: Probably you have started an Intel MPI application with mpirun while I_MPI_PMI_LIBRARY was set in your environment. Using srun instead of mpirun or unset I_MPI_PMI_LIBRARY.