Difference between revisions of "Brief Introduction to HPC Computing"

From HPC users
Jump to navigationJump to search
Line 1,504: Line 1,504:
This status file simply is a summary of <tt>stat</tt> (yielding the process status) and <tt>statm</tt> (yielding further information about the process status) in human readable form. An explanation
This status file simply is a summary of <tt>stat</tt> (yielding the process status) and <tt>statm</tt> (yielding further information about the process status) in human readable form. An explanation
of all the keywords listed therein is provided [http://wiki.directi.com/display/tu/Understanding+Processes+in+Linux here]. Most importantly, the peak memory usage and the current memory usage  
of all the keywords listed therein is provided [http://wiki.directi.com/display/tu/Understanding+Processes+in+Linux here]. Most importantly, the peak memory usage and the current memory usage  
are summarized by the keywords <tt>VmPeak</tt> and <tt>VmSize</tt>.
are summarized by the keywords <tt>VmPeak</tt> and <tt>VmSize</tt>. Here, the job uses approximately 54Mb, which is way less than the amount of 500Mb allocated during the submission procedure.
However, the allocated amount of 500Mb is still smaller than the default value of 1.2Gb. A job as slim as this one barely blocks any resources that might be needed by other users. Hence, the observed discrepancy
between allocated memory and actually used memory can be tolerated without complaints.

Revision as of 14:45, 26 November 2013

A brief introduction to the usage of the HPC facilities at the university of Oldenburg, especially targeted at new and unexperienced HERO and FLOW users is given below. The introduction is based on various minimal examples that illustrate how to compile non-parallel and parallel programs as well as how to submit jobs via SGE and monitor the status of submitted jobs.

A simple serial program

Example using the GNU compiler collection (gcc)

Consider the following "Hello World!" C program, called myExample.c:

 
#include <stdio.h>

int main(int argc, char *argv[]) {
  fprintf(stdout,"Hello World!\n");
  return 0;
}
  

In brief: once compiled an invoked, it only prints the string "Hello World!" to the standard out-stream.

Compiling the program

Once you log in to the system, you are on one of the two nodes hero01 and hero02 for users of HERO or flow01 and flow02 for users of FLOW. This is where you should compile your programs and from where you should submit your jobs from. After the log in, the GNU compiler collection is loaded by default. However, we will explicitly go through the steps needed to load a certain compiler, for that matter. Therefore, lets pretend the compiler we need is not loaded already. In order to be able to use a certain compiler one needs to load the respective user environment, specified by a particular module.

To get a list of the modules which are loaded currently, just type

 module list

For me, this yields this triggers the output

 Currently Loaded Modulefiles:
  1) shared        2) sge/6.2u5p2

Further, to get a list of all available gcc related modules, you might type

 module avail gcc

to obtain

 ------------------- /cm/shared/modulefiles --------------------
 gcc/4.3.4 gcc/4.6.3 gcc/4.7.1

In the subsequent example we will use gcc/4.7.1 to compile the program above. More information about that particular module can be obtained by typing

 module show gcc/4.7.1

Finally, to load the module just type

 module load gcc/4.7.1

which creates the desired user environment. You can check whether the proper compiler is loaded by typing which gcc. This now yields

 /cm/shared/apps/gcc/4.7.1/bin/gcc

Hence everything worked well and the stage is properly set in order to compile the example program by means of the statement

 gcc myExample.c -o myExample

to yield the executable myExample

Submitting a job

In order to submit the job via SGE, you might use the following job submission script, called myProg.sge:

 
#!/bin/bash

####### which shell to use
#$ -S /bin/bash

####### change to directory where job was submitted from
#$ -cwd

####### maximum walltime of the job (hh:mm:ss)
#$ -l h_rt=0:10:0

####### memory per job slot
#$ -l h_vmem=300M

####### disk space
#$ -l h_fsize=100M

####### name of the job
#$ -N basic_test

####### merge stdout and stderr
#$ -j y

./myExample
  

The resource allocation statements that are used in the job submission script above are explained here. Now, typing

 qsub myProg.sge

enqueues the job, assigning the jobId 704701 in this case.

Altering resource requirements

If you submitted a job and realize afterwards that you requested non-adequat resources, you basically have two choices. You can either delete the job using the command qdel, amend your job submission script and submit the job again, or you can use the command qalter which allows you to modify the resource list of your job using a statement similar to

 qalter -l h_vmem=2G -l h_fsize=10G -l h_rt=1:00:0 <JobId>

where JobId refers to the unique id of your job. Note that qalter overwrites the full resource list, hence you need to specify all the resource keywords that also appear in your original job submission script.

Checking the status of the job

As soon as the job is enqueued one can check its status by typing qstat. Immediately after submission one might get the output

 job-ID  prior   name       user         state submit/start at     queue                  slots ja-task-ID 
 ---------------------------------------------------------------------------------------------------------
  704713 0.00000 basic_test alxo9476     qw    05/15/2013 18:18:46                            1        

According to this, the job with ID 704713 has priority 0.00000 and resides in state qw loosely translated to "enqueued and waiting". Also, the above output indicates that the job requires a number of 1 slots. The column for the ja-task-ID, referring to the id of the particular task stemming from the execution of a job array (we don't work through a job array since we submitted a single job), is actually empty. Soon after, the priority of the job will take a value in between 0.5 and 1.0 (usually only slightly above 0.5), slightly increasing until the job starts. Here, after waiting a few seconds qstat triggers the output

 job-ID  prior   name       user         state submit/start at     queue                  slots ja-task-ID 
 ---------------------------------------------------------------------------------------------------------
  704713 0.50500 basic_test alxo9476     r     05/15/2013 18:19:15 mpc_std_shrt.q@mpcs001     1        

From the name of the queue, here

 mpc_std_shrt.q@mpcs001

one can already infer a lot. Guided by the resources specified in the job submission script, the scheduler assigned the job to the queue-Instance mpc_std_shrt.q on the host mpcs001, where the job is executed. Note that dependend of the load of the cluster the submitted job could stay a longer time in the wainting state 'qw' until it switch to the execution state 'r'.

Details for finished jobs

After a job has finished one can obtain further information about the resources actually required by the job by using the qacct utility. The only thing one has to provide is the Id of the job. In the current example, the jobId provided by SGE was 704713 and typing

 qacct -j 704713

yields a list of resources actually used by the application:

 
==============================================================
qname        mpc_std_shrt.q      
hostname     mpcs001.mpinet.cluster
group        ifp                 
owner        alxo9476            
project      NONE                
department   defaultdepartment   
jobname      basic_test          
jobnumber    704713              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Wed May 15 18:18:46 2013
start_time   Wed May 15 18:19:17 2013
end_time     Wed May 15 18:19:20 2013
granted_pe   NONE                
slots        1                   
failed       0    
exit_status  0                   
ru_wallclock 3            
ru_utime     0.025        
ru_stime     0.030        
ru_maxrss    4136                
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    7497                
ru_majflt    14                  
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     404                 
ru_nivcsw    27                  
cpu          0.055        
mem          0.000             
io           0.000             
iow          0.000             
maxvmem      84.773M
arid         undefined
  

A detailed description of the keys can be found by typing looking up the man page of the SGE accounting tool via man accounting.

Using local (scratch) storage for I/O intense serial jobs (only for HERO users)

Consider a situation where your particular application is rather I/O intense so that the speed of your program suffers from the amount of I/O operations that strain the global file system. Examples might be irregular I/O patterns at a fast pace or an application that has to create, open, close and delete many files. As a remedy in order to overcome such problems you might benefit from using a local scratch disk of an execution host on which your program is actually run. This reduces the amount of network traffic and hence reduces the strain on the global file system. The subsequent example illustrates how to access and use the local storage on a given host for the purpose of storing data during the runtime of the program. In the example, after the program terminates, the output data is copied to the working directory from which the job was submitted from and the local file system on the host is cleaned out. For this matter, consider the examplary C program myExample_tempdir.c

 
#include <stdio.h>

int main(int argc, char *argv[]) {
  FILE *myFile;

  myFile=fopen("my_data/myData.out","w");
  fprintf(myFile,"Test output to local scratch directory\n");

  fclose(myFile);
}
  

which, just for arguments (and to fully explain the job submission script below), is contained in the working directory

 $HOME/wmwr/my_examples/tempdir_example/

The program assumes that there is a directory my_data in the current working directory to which the file myData.out with a certain content (here the sequence of characters Test output to local scratch directory) will be written.

In oder to compile the program via the current gcc compiler, you could first set the stage by loading the proper modules, e.g.,

 module clear
 module load sge
 module load gcc/4.7.1

and then compile via

  gcc myExample_tempdir.c -o myExample_tempdir

to yield the binary myExample_tempdir.

At this point bear in mind that we do not want to execute the binary by hand right away! Instead, we would like to leave it to SGE to determine a proper queue instance (guided by the resources we subsequently will specify for the job) on a host with at least one free slot, where the job will be executed. A proper job submission script, here called myProg_tempdir.sge, that takes care of creating the folder my_data needed by the program myExample_tempdir in order to store its output in a temporal directory on the executing host reads

 
#!/bin/bash

####### which shell to use
#$ -S /bin/bash

####### change to directory where job was submitted from
#$ -cwd

####### maximum walltime of the job (hh:mm:ss)
#$ -l h_rt=0:10:0

####### memory per job slot
#$ -l h_vmem=100M

####### since working with local storage, no need to request disk space

####### name of the job
#$ -N tmpdir_test

####### merge stdout and stderr
#$ -j y

####### change current working directory to the local /scratch/<jobId>.<x>.<qInst>
####### directory, available as TMPDIR on the executing host with HOSTNAME
cd $TMPDIR

####### write details to <jobName>.o<jobId> output file
echo "HOSTNAME = " $HOSTNAME
echo "TMPDIR   = " $TMPDIR

####### create output directory on executing host (parent folder is TMPDIR)
mkdir my_data

####### run program
$HOME/wmwr/my_examples/tempdir_example/myExample_tempdir

####### copy the output to the directory the job was submitted from
cp -a ./my_data $HOME/wmwr/my_examples/tempdir/
  

Note that in the above job submission script there is no need to request disk space by setting the resource h_fsize since we are working with local storage provided by the execution host. Submitting the script via

 qsub myProg_tempdir.sge

enqueues the respective job, here having jobId 703914. After successful termination of the job, the folder my_data is moved to the working directory from which the job was originally submitted from. Also, the two job status files tmpdir_test.e703914 and tmpdir_test.o703914 where created that might contain further details associated with the job. The latter file should contain the name of the host on which the job actually ran and the name of the temporal directory. And indeed, cat tmpdir_test.o703914 reveals the file content

 HOSTNAME =  mpcs001
 TMPDIR   =  /scratch/703914.1.mpc_std_shrt.q

Further, the file my_data/myData.out contains the line

 Test output to local scratch directory

as expected. Note that the temporary directory $TMPDIR (here: /scratch/703914.1.mpc_std_shrt.q) on the execution host (here: mpcs001) is automatically cleaned out. Finally, note that since $TMPDIR is created on a single host, the procedure outlined above works well only if your application runs on a single host. I.e., it is feasible for jobs that either request only a single slot (i.e. non-parallel jobs) or for parallel jobs for which all requested slots fit onto the same host. However, due to the "fill up" allocation rule obeyed by SGE, this cannot be guaranteed in general.

Note for FLOW: FLOW has no local file system. So this example won't work on FLOW!

Setting up array jobs

Consider a situation where you need to re-run your program several times, possibly for different sets of input data, for that matter. Then you might benefit from setting up an array job. Here, we will work through a simple example that illustrates how to set up such an array job. Therefore, consider the simple C program called myExample_jobArray.c:

 
#include <stdio.h>

int main(int argc, char *argv[]) {
  FILE *myFile;

  myFile=fopen("myData.out","a");
  fprintf(myFile,"%s\n",argv[1]);
  fflush(myFile);
  fclose(myFile);

  return 0;
}
  

In brief: once compiled an invoked, it appends a string (corresponding to the first command-line argument) to the file myData.out. For arguments, say you just logged in to the system and the default gcc compiler is loaded (you can check this by typing: which gcc), you might compile the above program via

 gcc myExample_jobArray.c -o myExample_jobArray

to yield the executable myExample_jobArray. Further, say you would like to run the program ten times, considering the command line arguments myTask_01 through myTask_10. The first thing you might do in order to set up an array job that summarizes these ten individual jobs is to create an auxiliary file myArgList.txt with the content

 
myTask_01
myTask_02
myTask_03
myTask_04
myTask_05
myTask_06
myTask_07
myTask_08
myTask_09
myTask_10
  

Now, a proper job submission script that works through this file line-by-line is given by the file myProg_jobArray.sge:

 
#!/bin/bash

####### which shell to use
#$ -S /bin/bash

####### change to directory where job was submitted from
#$ -cwd

####### maximum walltime of the job (hh:mm:ss)
#$ -l h_rt=0:10:0

####### memory per job slot
#$ -l h_vmem=300M

####### disk space
#$ -l h_fsize=100M

####### name of the job
#$ -N jobArray_test

####### merge of stdout and stderr
#$ -j y

####### on FLOW you have to uncomment following line!!! Otherwise you block a complete node for a single job.
# #$ -l excl_flow=false

#$ -t 1-10:1 
#$ -tc 2
./myExample_jobArray $(sed -n ${SGE_TASK_ID}'p' myArgList.txt) 
  

therein, the option -t iniVal-finVal:stepSize initializes the variable SGE_TASK_ID to successively take values from 1 to 10 in steps of 1. Further, the option -tc nJobs takes care that only a number of nJobs jobs run at a time. Note that for FLOW users you have to set

 #$ -l excl_flow=false

by uncommenting the line in the example above. Finally, the array job can be submitted as

 qsub myProg_jobArray.sge

this time assigning the jobId 704910. The state of the job can be checked by typing qstat which in this case yields the output

 job-ID  prior   name       user         state submit/start at     queue                  slots ja-task-ID
 ---------------------------------------------------------------------------------------------------------
  704910 0.50500 jobArray_t alxo9476     r     05/16/2013 11:35:03 mpc_std_shrt.q@mpcs001     1 3
  704910 0.50500 jobArray_t alxo9476     r     05/16/2013 11:35:03 mpc_std_shrt.q@mpcs002     1 4
  704910 0.00000 jobArray_t alxo9476     qw    05/16/2013 11:33:52                            1 5-10:1

One can see that only two jobs are running, the remaining jobs are still enqueued (due to the use of -tc 2 in the job submission script). Further, qstat lists additional information on the integer identifier associated to a particular job array task processed (shown in the rightmost column). Thus, from the above output it is evident that jobs with task-ID 3 and 4 are currently running, while jobs 5 through 10 are still enqueued and waiting. Consequently, the jobs with task-ID 1 and 2 are already finished. Once all tasks summarized by the array job are finished, the data file myData.out contains

 
myTask_01
myTask_02
myTask_03
myTask_04
myTask_05
myTask_06
myTask_07
myTask_08
myTask_09
myTask_10
  

Note that if several tasks are processed at a time it might be that they don't finish "in order", i.e., it might happen that, say, task 4 finishes earlier than task 3. In such a (common) situation it might happen that the results listed in the myData.out file are not in the same order as the respective input arguments listed in the myArgList.txt file.

As pointed out above, the utility qacct can be used to retrieve further information on finished jobs. This holds also for array jobs. The individual jobs summarized by an array job are enumerated by means of a job task-ID (as discussed above). Consequently, for a given jobId (here: 704910), qacct outputs a list of details for each taks processed during the execution of the respective array job. Here, to save space, we only filter for the hostname, task-ID and exit status of the different tasks by writing

 qacct -j 704910 | grep "hostname\|taskid\|exit_status\|="

this yields

 
==============================================================
hostname     mpcs001.mpinet.cluster
taskid       1                   
exit_status  0                   
==============================================================
hostname     mpcs002.mpinet.cluster
taskid       2                   
exit_status  0                   
==============================================================
hostname     mpcs001.mpinet.cluster
taskid       3                   
exit_status  0                   
==============================================================
hostname     mpcs002.mpinet.cluster
taskid       4                   
exit_status  0                   
==============================================================
hostname     mpcs001.mpinet.cluster
taskid       5                   
exit_status  0                   
==============================================================
hostname     mpcs002.mpinet.cluster
taskid       6                   
exit_status  0                   
==============================================================
hostname     mpcs001.mpinet.cluster
taskid       7                   
exit_status  0                   
==============================================================
hostname     mpcs002.mpinet.cluster
taskid       8                   
exit_status  0                   
==============================================================
hostname     mpcs001.mpinet.cluster
taskid       9                   
exit_status  0                   
==============================================================
hostname     mpcs002.mpinet.cluster
taskid       10                  
exit_status  0                   
  

A simple parallel program

In order to illustrate how to compile, submit and monitor a basic parallel C program that uses the Message Passing Interface (MPI), consider the following code contained in the file myHelloWorld_mpi.c:

 
#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[]) {

  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME];

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  printf("Process %d (out of %d) on host %s\n",rank, numprocs, processor_name);

  MPI_Finalize();
}  
  

To sum it up: if the program is compiled, requested to use a number of, say, M slots and finally runs, each process writes out a certain statement that specifies its rank and the name of its execution host to the standard out-stream.

Note that, in order to compile and submit the program, you might use one out of a variety of parallel environments (PEs). You can get an overview of all possible PEs by typing

 qconf -spl

which, however, is not listed here. Out of the many possible choices (some of them fit rather special needs for certain applications which a typical HPC user does not even have to worry about) we will subsequently consider the PEs openmpi and impi in more detail.

Example using the openmpi parallel environment

So as to use the openmpi parallel environment, you need to load the proper modules prior to compiling the above program (see here). To see which openmpi-modules are available you might type module avail openmpi, yielding

 
------------------- /cm/shared/modulefiles --------------------
openmpi/1.4.3/gcc/64/4.3.4
openmpi/1.4.3/intel/ics/2011.0.013/64
openmpi/1.4.3/intel/ics/64/2011.0.013
openmpi/1.6.2/gcc/64/4.7.1
openmpi/1.6.2-debug/gcc/64/4.7.1
openmpi/gcc/64/1.4.2
openmpi/gcc/64/1.4.3
openmpi/intel/64/10.1.015/1.4.5
openmpi/intel/compiler/64/10.1.015/1.4.5
openmpi/intel/ics/64/2011.0.013/1.4.3
openmpi/open64/64/1.4.2
  

Subsequently we will use openmpi version 1.6.2, built with gcc version 4.7.1.

Compiling the program

To load the proper modules you might type:

 module unload gcc
 module load gcc/4.7.1
 module load openmpi/1.6.2/gcc/64/4.7.1

Note that it might be necessary to unload other modules that result in a conflict, first! More information on module details and possible conflicts can be found by typing, e.g.,

 module show openmpi/1.6.2/gcc/64/4.7.1

Also, a list of all currently loaded modules is available by typing

 module list

Once the proper modules are loaded you might proceed to compile the program via

 mpicc myHelloWorld_mpi.c -o myHelloWorld_openMpi

You might first check whether mpicc indeed refers to the desired compiler by typing which mpicc, which in this case yields

 /cm/shared/apps/openmpi/1.6.2/gcc/64/4.7.1/bin/mpicc

so everything is fine and the stage is properly set!

Submitting a job

In order to submit the job via SGE, specifying a parallel environment (PE) that fits your choice (here: openMpi), you might use the following job submission script, called myProg_openMpi.sge:

 
#!/bin/bash

####### which shell to use
#$ -S /bin/bash

####### change to directory where job was submitted from
#$ -cwd

####### maximum walltime of the job (hh:mm:ss)
#$ -l h_rt=0:10:0

####### memory per job slot
#$ -l h_vmem=1000M

####### disk space
#$ -l h_fsize=1G

####### which parallel environment to use, and number of slots
#$ -pe openmpi 12
# for FLOW users: use following line and please comment the line above out
# #$ -pe openmpi_ib 12

####### enable resource reservation (to prevent starving of parallel jobs)
#$ -R y

####### name of the job
#$ -N openMpi_test

module unload gcc
module load gcc/4.7.1
module load openmpi/1.6.2/gcc/64/4.7.1

# for HERO users
mpirun --mca btl ^openib,ofud -machinefile $TMPDIR/machines -n $NSLOTS ./myHelloWorld_openMpi

# for FLOW users: use following line and please comment the line above out
# mpirun --mca btl openib,sm,self -machinefile $TMPDIR/machines -n $NSLOTS ./myHelloWorld_openMpi
  

Most of the resource allocation statements should look familiar to you (if not, see here). However, note that a few of them are required to ensure a proper submission of parallel jobs. E.g., you need to take care to use the proper PE: in the job submission script this is done by means of the statement

 #$ -pe <parallel_environment> <num_slots>

wherein <parallel_environment> refers to the type of PE that fits your application and where <num_slots> specifies the number of desired slots for the parallel job. Here we decided to use openMpi, hence, the proper PE in reads openmpi. Further, in the above example a number of 12 slots is requested.

NOTE for FLOW: Please see the comments in the script for FLOW users. Additional hint: On FLOW you should use for parallel jobs always a multiple of 12 as number of slots to fill up the nodes which have 12 cores.

Now, typing

 qsub myProg_openMpi.sge

enqueues the job, assigning the jobId 704398 in my case.

Checking the status of the job

Once the job starts to run, it is possible to infer from which hosts the 13 requested slots are accumulated by typing qstat -g t, which in my case yields

 
job-ID  prior   name       user         state submit/start at     queue                  master ja-task-ID 
----------------------------------------------------------------------------------------------------------
 704398 0.50735 openMpi_te alxo9476     r     05/15/2013 09:54:23 mpc_std_shrt.q@mpcs002 MASTER        
                                                                  mpc_std_shrt.q@mpcs002 SLAVE         
                                                                  mpc_std_shrt.q@mpcs002 SLAVE         
 704398 0.50735 openMpi_te alxo9476     r     05/15/2013 09:54:23 mpc_std_shrt.q@mpcs004 SLAVE         
                                                                  mpc_std_shrt.q@mpcs004 SLAVE         
                                                                  mpc_std_shrt.q@mpcs004 SLAVE         
                                                                  mpc_std_shrt.q@mpcs004 SLAVE         
                                                                  mpc_std_shrt.q@mpcs004 SLAVE         
 704398 0.50735 openMpi_te alxo9476     r     05/15/2013 09:54:23 mpc_std_shrt.q@mpcs006 SLAVE         
                                                                  mpc_std_shrt.q@mpcs006 SLAVE         
 704398 0.50735 openMpi_te alxo9476     r     05/15/2013 09:54:23 mpc_std_shrt.q@mpcs008 SLAVE         
                                                                  mpc_std_shrt.q@mpcs008 SLAVE         
                                                                  mpc_std_shrt.q@mpcs008 SLAVE         
  

Meanwhile the job has terminated successfully, there where 4 files created: openMpi_test.e704398, openMpi_test.o704398, openMpi_test.pe704398 and openMpi_test.po704398. In detail they contain:

  • openMpi_test.po704398: the hostfile for the job which can be found in the spool directory for the MASTER process (which in this case is mpcs002), reading
 
-catch_rsh /cm/shared/apps/sge/current/default/spool/mpcs002/active_jobs/704398.1/pe_hostfile
mpcs002.mpinet.cluster
mpcs002.mpinet.cluster
mpcs004.mpinet.cluster
mpcs004.mpinet.cluster
mpcs004.mpinet.cluster
mpcs004.mpinet.cluster
mpcs004.mpinet.cluster
mpcs006.mpinet.cluster
mpcs006.mpinet.cluster
mpcs008.mpinet.cluster
mpcs008.mpinet.cluster
mpcs008.mpinet.cluster
  
  • openMpi_test.pe704398: nothing (which is good!)
  • openMpi_test.o704398: the (expected) program output, reading
 
Process 7 (out of 12) on host mpcs006
Process 8 (out of 12) on host mpcs006
Process 6 (out of 12) on host mpcs004
Process 3 (out of 12) on host mpcs004
Process 5 (out of 12) on host mpcs004
Process 2 (out of 12) on host mpcs004
Process 4 (out of 12) on host mpcs004
Process 10 (out of 12) on host mpcs008
Process 9 (out of 12) on host mpcs008
Process 11 (out of 12) on host mpcs008
Process 1 (out of 12) on host mpcs002
Process 0 (out of 12) on host mpcs002
  
  • openMpi_test.e704398: if there are N hosts involved to run your application (here: N=4), there should be N-1 harmless error messages (each consisting of two lines) of the form
 
bash: module: line 1: syntax error: unexpected end of file
bash: error importing function definition for `module'
bash: module: line 1: syntax error: unexpected end of file
bash: error importing function definition for `module'
bash: module: line 1: syntax error: unexpected end of file
bash: error importing function definition for `module'
  

This is a harmless, well known and documented error for the SGE version (6.2u5) used on the local HPC facilities (see here) which you might safely ignore.

Example using the impi parallel environment

So as to use the impi parallel environment, you need to load the proper modules prior to compiling the above program (see here). To see which intel mpi modules are available you might type module avail intel/impi, yielding

 
------------------- /cm/shared/modulefiles --------------------
intel/impi/32/4.0.1.007 intel/impi/4.0.1.007/64
intel/impi/32/4.1.0.024 intel/impi/64/4.0.1.007
intel/impi/4.0.1.007/32 intel/impi/64/4.1.0.024  

Subsequently we will use the 64-bit intel Mpi library with the environment as specified by the module intel/impi/64/4.1.0.024.

Compiling the program

To load the proper modules you might type:

 module unload openmpi
 module load intel/impi/64/4.1.0.024

Note that it might be necessary to unload other modules that result in a conflict, first! More information on module details and possible conflicts can be found by typing, e.g.,

 module show intel/impi/64/4.1.0.024

Also, a list of all currently loaded modules is available by typing

 module list

Once the proper modules are loaded you might proceed to compile the program via

 mpicc myHelloWorld_mpi.c -o myHelloWorld_intelMpi

You might first check whether mpicc indeed refers to the desired compiler by typing which mpicc, which in this case yields

 /cm/shared/apps/intel/ics/2013.0.028/impi/4.1.0.024/intel64/bin/mpicc

so everything is fine and the stage is properly set!

Submitting the program

In order to submit the job via SGE, specifying a parallel environment (PE) that fits your choice (here: impi), you might use the following job submission script, called myProg_intelMpi.sge:

 
#!/bin/bash 

####### which shell to use
#$ -S /bin/bash

####### change to directory where job was submitted from
#$ -cwd

####### maximum walltime of the job (hh:mm:ss)
#$ -l h_rt=0:10:0

####### memory per job slot
#$ -l h_vmem=1000M

####### disk space
#$ -l h_fsize=1G

####### which parallel environment to use, and number of slots
#$ -pe impi41 12

####### enable resource reservation (to prevent starving of parallel jobs)
#$ -R y

####### name of the job
#$ -N intelMpi_test

####### merge stdout and stderr
#$ -j y

module load intel/impi/64/4.1.0.024

# for HERO users
mpirun -bootstrap sge -machinefile $TMPDIR/machines -np $NSLOTS ./myHelloWorld_impi

# for FLOW users: use following line by uncommenting and please comment line above out
# mpirun -bootstrap sge -machinefile $TMPDIR/machines -np $NSLOTS -env I_MPI_FABRICS shm:ofa ./myHelloWorld_impi
  

Most of the resource allocation statements should look familiar to you (if not, see here). However, note that a few of them are required to ensure a proper submission of parallel jobs. E.g., you need to take care to use the proper PE: in the job submission script this is done by means of the statement

 #$ -pe <parallel_environment> <num_slots>

wherein <parallel_environment> refers to the type of PE that fits your application and where <num_slots> specifies the number of desired slots for the parallel job. Here we decided to use impi, hence, the proper PE in reads impi. Further, in the above example a number of 12 slots is requested.

Now, typing

 qsub myProg_intelMpi.sge

ensues the job, assigning the jobId 704648 in my case.

The parallel environment memory issue

As pointed out above, after a job has finished one can obtain further information about the resources actually required by the job by using the qacct utility. The only thing one has to provide is the Id of the job. To point out one specific issue of using PEs (in particular on HERO), reconsider the openmpi example above. In that example, the jobId provided by SGE was 704398. Typing qacct -j 704398 yields a list of resources actually used by the application:

 
==============================================================
qname        mpc_std_shrt.q      
hostname     mpcs002.mpinet.cluster
group        ifp                 
owner        alxo9476            
project      NONE                
department   defaultdepartment   
jobname      openMpi_test        
jobnumber    704398              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Wed May 15 09:53:55 2013
start_time   Wed May 15 09:54:24 2013
end_time     Wed May 15 09:54:36 2013
granted_pe   impi41             
slots        12                  
failed       0    
exit_status  0                   
ru_wallclock 12           
ru_utime     0.602        
ru_stime     0.459        
ru_maxrss    20976               
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    51650               
ru_majflt    523                 
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     14798               
ru_nivcsw    1347                
cpu          1.061        
mem          0.030             
io           0.002             
iow          0.000             
maxvmem      775.348M
arid         undefined
  

Consider the value of maxvmem used by the job. On the first sight this seems a bit odd, given that the application was a simple "hello world" program! However, if the processes that belong to a job (here: 12 processes) are distributed over several hosts (here: 4 hosts), the MASTER process has to handle all the SLAVE processes. Therefore it has to set up and maintain a connection to all the remote hosts which definitely costs some memory (easily 150 to 200M per host). However, note that these memory requirements accumulate for the MASTER process only, the SLAVE processes need less memory. Therefore, if one submits a large parallel job which might be executed on several hosts one has to make sure that the MASTER process does not run out of resources, therefore one has to allocate sufficient memory. Otherwise the job will be killed.


Example using the smp parallel environment

The smp parallel environment is somewhat special. It requires all the requested slots to be available on a single execution host, see here. Hence, as will be discussed below, special care has to be taken to properly specify the resouces needed for a job (NOTE: this section was motivated by the user ruxi6902).

A simple open MP example program

In order to illustrate how to compile, submit and monitor a basic open MP program for use with the smp parallel environment, consider the following code contained in the file myHelloWorld_omp.c:

 
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
  int nThreads, tId;

/* START OF PARALLEL REGION */
/* Fork a team of threads  */
#pragma omp parallel private(nThreads,tId)
  {       
    /* Each thread has own, private tId variable */
    tId = omp_get_thread_num();
    printf("Hello World thread Id %d\n",tId);

    /* the following block of statements is only *
     * executed by the mater thread (which has   *
     * tId==0 by default)                        */
    if(tId==0){
      nThreads = omp_get_num_threads();
      printf("Number of threads = %d\n",nThreads);
    }
  } 
/* END OF PARALLEL REGION */

  return 0;
}
  

In summary: Once compiled and run, the program starts off as serial code until the block, signified as parallel region is met. If the program is invoked for use with, say, M slots, the master thread creates a team of M parallel threads. Further, every thread executes all the code listed in the parallel region. In the above code the open MP library, implementing several useful subroutines for use with open MP, is used to obtain the individual thread Ids (which are private to each thread), and the total number of threads (read from the master thread). Note that a fairly complete open MP tutorial, from which this example was adopted, is available here.

Compiling and submitting the program

Say you want to use the gcc compiler. Then you might first load the proper module, if it is not loaded by default, via

 module load gcc/4.7.1

creating the desired user environment. The above open MP example program can then be compiled by means of

 gcc -fopenmp -o myHelloWorld_smp myHelloWorld_omp.c

In order to submit the job via SGE, you might use the following job submission script, here called myProg_openMP.sge:

 
#!/bin/bash

####### which shell to use
#$ -S /bin/bash

####### change to directory where job was submitted from
#$ -cwd

####### maximum walltime of the job (hh:mm:ss)
#$ -l h_rt=0:10:0

####### memory per job slot
#$ -l h_vmem=1000M

####### disk space
#$ -l h_fsize=1G

####### which parallel environment to use, and number of slots
#$ -pe smp 5

####### enable resource reservation (to prevent starving of parallel jobs)
#$ -R y

####### name of the job
#$ -N openMp_test

module unload gcc
module load gcc/4.7.1

export OMP_NUM_THREADS=$NSLOTS
./myHelloWorld_smp
  

The resource allocation statements used in the job submission script are explained here. At this point, note that open MP uses environment variables to controll the execution of parallel jobs at runtime. However, setting the open MP environment variables is as easy as setting other environment variables and it depends on which shell you are actually using. Above we specified a bash shell, thus, setting a value for the environment variable OMP_NUM_THREADS is done via

 export OMP_NUM_THREADS=$NSLOTS

Now, typing

 qsub myProg_openMp.sge

enqueues the job, assigning the jobId 749772 in this case.

As soon as the job is in state running, one can get an idea of where the parallel threads are running. In this question the query qstat -g c yields

 
job-ID  prior   name       user         state submit/start at     queue master ja-task-ID 
----------------------------------------------------------------------------------------------------------
 749772 0.50598 openMp_tes alxo9476     r     06/26/2013 16:14:17 mpc_std_shrt.q@mpcs105 MASTER        
                                                                  mpc_std_shrt.q@mpcs105 SLAVE         
                                                                  mpc_std_shrt.q@mpcs105 SLAVE         
                                                                  mpc_std_shrt.q@mpcs105 SLAVE        
                                                                  mpc_std_shrt.q@mpcs105 SLAVE         
                                                                  mpc_std_shrt.q@mpcs105 SLAVE         
  

As it appears, the job is executed on host mpcs105 with all parallel threads running on that single host (as it should be if one uses the smp parallel environment).

After the job has terminated successfully, there where 4 files created: openMp_test.pe749772, openMp_test.po749772, openMp_test.e749772, and openMp_test.o749772. The first two list the output related to the setup of the parallel environment and the latter two contain the direct output of the submitted program to the stdandard error and output stream, respectively. Only the very last file is non-empty, containing the lines

 Hello World thread Id 2
 Hello World thread Id 3
 Hello World thread Id 4
 Hello World thread Id 0
 Hello World thread Id 1
 Number of threads = 5

Further informatin about this job can, after termination of the job, of course be obtained using qacct. In this regard, the query

 qacct -j 749772

yields the output

 
==============================================================
qname        mpc_std_shrt.q      
hostname     mpcs105.mpinet.cluster
group        ifp                 
owner        alxo9476            
project      NONE                
department   defaultdepartment   
jobname      openMp_test         
jobnumber    749772              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Wed Jun 26 16:13:21 2013
start_time   Wed Jun 26 16:14:18 2013
end_time     Wed Jun 26 16:14:19 2013
granted_pe   smp                 
slots        5                   
failed       0    
exit_status  0                   
ru_wallclock 1            
ru_utime     0.058        
ru_stime     0.043        
ru_maxrss    4124                
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    9494                
ru_majflt    2                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     423                 
ru_nivcsw    30                  
cpu          0.101        
mem          0.001             
io           0.000             
iow          0.000             
maxvmem      89.887M
arid         undefined
  

Again, note that by using the smp parallel environment, all slots are allocated on a single execution host (in this case the scheduler sent the job to the host mpcs105).

Further things to note

There are some things to note about the smp parallel environment:

  • Memory requirenment for the master thread: Since the job does not run accross different execution hosts, the parallel environment memory issue illustrated above is not an issue here. In the previous paragraph you can see that by using 5 slots, the overall memory requirenments amount to roughly 90MB (for this basic program).
  • Usage of local scratch storage: Since, by using the smp parallel environment, all slots of a submitted job are allocated on a single host you might consider to use the local scratch storage if you intend to submit I/O intense jobs, as discussed here.
  • Take care to properly request resources: A standard host on HERO (comprising the nodes mpcs001 through mpcs130) offers 12 slots and an overall memory of 24GB. So, if you require more than that you need to request one of the big nodes, see here.

An example regarding the latter point: Just for arguments, say, you submit a job using the resource requirements

 #$ -l h_vmem=5GB
 #$ -pe smp 8

I.e., by default you request a standard HERO node with 24GB of memory. However, note that your memory requirements amount to an overall number of 40GB, exeeding what a standard node offers by far. In such a case, the last line of the query qstat -j JOB_ID | tail -1 requesting the status of your job with integer specifier JOB_ID will read

 cannot run in PE "smp" because it only offers 0 slots

which somehow describes what happens but does not allude to a solution to your problem. Now, as a remedy you could

  1. Check your memory requirements and request less than, say, 23GB memory so that the job can be executed on a standard node (bear in mind that not the full 24GB of a node can be requested for your job).
  2. Leave the memory requirements as they are and request one of the big nodes, offering 12 slots and 46GB of memory, see here. You might request such a big node using the additional option #$ -l bignode=true in your job submission script.

Tracking memory issues and memory consumption using valgrind

Before you actually compile your program and submit it via SGE, you might perform some checks on your local computer. For the purpose of memory checking there are many programs available. E.g., if your program runs long enough so that you can determine its process ID, you might for example use the top command to get an idea about its actual (momentary) memory consumption. However, there are more powerful tool that also allow you to determine whether your program exhibits memory leaks, invalid pointer use or things like that. Here, I will illustrate some basic usage of valgrind, a program that offers tools that, e.g., check the memory management of your code.

Therefore, consider the following example program (with annotated line numbers), called myExample_malloc.c:

 
  1 #include <stdio.h>
  2 #include <stdlib.h>
  3 
  4 int main(int args, char *argv[]){
  5   int *myArray_a, *myArray_b, *myArray_c;
  6 
  7   myArray_a=(int *) malloc(300*sizeof(int));
  8   myArray_b=(int *) malloc(1000*sizeof(int));
  9   free(myArray_b);
 10   myArray_c=(int *) malloc(200*sizeof(int));
 11   free(myArray_a);
 12   free(myArray_c);
 13 
 14   return 0;
 15 }
  

In brief: once compiled and invoked, it allocates three portions of heap memory and also frees the associated memory before exiting. As stated above, the program has no memory related issues. Howver, subsequently we will implement a couple of common errors in the above program so as to facilitate intuition on how the respective valgrind output might be interpreted.

Compile for elaborate debugging information

In order to be able to better interpret the information provided by valgrind it is good practice to compile your program so that it provides further debugging information. Using gcc this is done by adding the compiler option -g. For the above example you might type

 gcc myExample_malloc.c -o myExample_malloc -g

to obtain the executable myExample_malloc, compiled using further debugging symbols.

Example 1: no memory issues

To cut a (rather) long story short, you might type

 valgrind --tool=memcheck --leak-check=full ./myExample_malloc 

to result in valgrind listing a summary of calls to malloc and free. Here, the output reads:

 
==3314== Memcheck, a memory error detector
==3314== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==3314== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==3314== Command: ./myExample_malloc
==3314== 
==3314== 
==3314== HEAP SUMMARY:
==3314==     in use at exit: 0 bytes in 0 blocks
==3314==   total heap usage: 3 allocs, 3 frees, 6,000 bytes allocated
==3314== 
==3314== All heap blocks were freed -- no leaks are possible
==3314== 
==3314== For counts of detected and suppressed errors, rerun with: -v
==3314== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4)
  

The above precise listing of the program reports no memory issue. The error summary lists 0 errors from 0 contexts, which is good. In general, note that the using valgrind will slow down the execution of the program considerably.:w


Example 2: non-freed memory

By intention, lets introduce a particular memory leak to the program myExample_malloc.c. Therefore, lets comment line number 11 which is responsible to free the memory associated with the array myArray_a:

 11   // free(myArray_a);

Recompiling via

 gcc myExample_malloc.c -o myExample_malloc -g

and, again, using the memcheck tool provided by valgrind via

 valgrind --tool=memcheck --leak-check=full --show-reachable=yes ./myExample_malloc

yields the output

 
==1983== Memcheck, a memory error detector
==1983== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==1983== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==1983== Command: ./myExample_malloc
==1983== 
==1983== 
==1983== HEAP SUMMARY:
==1983==     in use at exit: 1,200 bytes in 1 blocks
==1983==   total heap usage: 3 allocs, 2 frees, 6,000 bytes allocated
==1983== 
==1983== 1,200 bytes in 1 blocks are definitely lost in loss record 1 of 1
==1983==    at 0x4A0610C: malloc (vg_replace_malloc.c:195)
==1983==    by 0x4005D4: main (myExample_malloc.c:7)
==1983== 
==1983== LEAK SUMMARY:
==1983==    definitely lost: 1,200 bytes in 1 blocks
==1983==    indirectly lost: 0 bytes in 0 blocks
==1983==      possibly lost: 0 bytes in 0 blocks
==1983==    still reachable: 0 bytes in 0 blocks
==1983==         suppressed: 0 bytes in 0 blocks
==1983== 
==1983== For counts of detected and suppressed errors, rerun with: -v
==1983== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
  

Note that, in the error summary, valgrind reports 3 allocation process and only 2 free processes, leaving 1.200 bytes used at exit. In the leak summary below, they are signified as "definitely lost". Right before the leak summary, some details are given regarding which statement led to the lost memory. The respective snip reads

 
==1983== 1,200 bytes in 1 blocks are definitely lost in loss record 1 of 1
==1983==    at 0x4A0610C: malloc (vg_replace_malloc.c:195)
==1983==    by 0x4005D4: main (myExample_malloc.c:7)
  

indicating that the lost memory was allocated by a call to malloc in line 7. And indeed, this is the line in the program myExample_malloc.c where the memory for the array myArray_a was allocated.

Example 3: invalid pointer use

In the program myExample_malloc.c, the array myArray_a is designed to hold a number of 300 values of type int which can be indexed using myArray_a[i] with i=0...299. Now, to illustrate how valgrind responds to the use of invalid heap memory lets introduce a further line after line 7 that tries to initialize a value at a location past the end of the respective array:

 8 myArray_a[300]=7;

Recompiling via

 gcc myExample_malloc.c -o myExample_malloc -g

and, again, using the memcheck tool provided by valgrind via

 valgrind --tool=memcheck --leak-check=full --show-reachable=yes ./myExample_malloc

yields the output

 
==4184== Memcheck, a memory error detector
==4184== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==4184== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==4184== Command: ./myExample_malloc
==4184== 
==4184== Invalid write of size 4
==4184==    at 0x4005E3: main (myExample_malloc.c:8)
==4184==  Address 0x4c1e4f0 is 0 bytes after a block of size 1,200 alloc'd
==4184==    at 0x4A0610C: malloc (vg_replace_malloc.c:195)
==4184==    by 0x4005D4: main (myExample_malloc.c:7)
==4184== 
==4184== 
==4184== HEAP SUMMARY:
==4184==     in use at exit: 0 bytes in 0 blocks
==4184==   total heap usage: 3 allocs, 3 frees, 6,000 bytes allocated
==4184== 
==4184== All heap blocks were freed -- no leaks are possible
==4184== 
==4184== For counts of detected and suppressed errors, rerun with: -v
==4184== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
  

Note that, in the error summary, valgrind reports 1 error. Right before the error summary, some details are given regarding which statement led to the detected error:

 
==4184== Invalid write of size 4
==4184==    at 0x4005E3: main (myExample_malloc.c:8)
==4184==  Address 0x4c1e4f0 is 0 bytes after a block of size 1,200 alloc'd
==4184==    at 0x4A0610C: malloc (vg_replace_malloc.c:195)
==4184==    by 0x4005D4: main (myExample_malloc.c:7)
  

Valgrind reports that an invalid writing operation of size 4 (corresponding to an integer on my desktop computer) occured at line 8 (which was the erroneous line we intentionally added to our code). In this manner, valgrind reports on the use of invalid heap memory.

Monitoring heap memory via massif

Subsequently the valgrind tool massif, a heap profiler that shows how much heap memory a given program actually uses, is illustrated. Therefore, consider the above program myExample_malloc.c in its unmodified form and compile it for further debugging information. Then, you might use valgrind by specifying the massif heap profiler in the form (the possible command line options can, e.g., be found here)

 valgrind --tool=massif --time-unit=B ./myExample_malloc

This generate a human readable output file, here named massif.out.4941, which might be postprocessed according to

 ms_print massif.out.4941

to yield more details of the heap memory consumption of the input program. The full output reads

 
--------------------------------------------------------------------------------
Command:            ./myExample_malloc
Massif arguments:   --time-unit=B
ms_print arguments: massif.out.4941
--------------------------------------------------------------------------------


    KB
5.094^                               ########################                 
     |                               #                                        
     |                               #                                        
     |                               #                                        
     |                               #                                        
     |                               #                                        
     |                               #                                        
     |                               #                                        
     |                               #                                        
     |                               #                                        
     |                               #                                        
     |                               #                                        
     |                               #                                        
     |                               #                           ::::::::     
     |                               #                           :            
     |                               #                           :            
     |       ::::::::::::::::::::::::#                       :::::            
     |       :                       #                       :   :       :::: 
     |       :                       #                       :   :       :    
     |       :                       #                       :   :       :    
   0
+----------------------------------------------------------------------->KB
     0                                                                   11.77

Number of snapshots: 8
 Detailed snapshots: [3 (peak)]

--------------------------------------------------------------------------------
  n        time(B)         total(B)   useful-heap(B) extra-heap(B)	stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0		0
  1          1,208            1,208            1,200             8		0
  2          5,216            5,216            5,200            16		0
  3          5,216            5,216            5,200            16		0
99.69% (5,200B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->76.69% (4,000B) 0x4005E1: main (myExample_malloc.c:8)
| 
->23.01% (1,200B) 0x4005D3: main (myExample_malloc.c:7)
  
--------------------------------------------------------------------------------
  n        time(B)         total(B)   useful-heap(B) extra-heap(B)	stacks(B)
--------------------------------------------------------------------------------
  4          9,224            1,208            1,200             8		0
  5         10,032            2,016            2,000            16		0
  6         11,240              808              800             8		0
  7         12,048                0                0             0		0
  

In the first part, an ascii histogram is given that gives a visual account of the heap memory consumption. Therein, a vertical bar along the horizontal axis represents a measurement of the memory usage at a given point in time. The chosen time unit beeing bytes allocated/freed (indicated by the command line option --time-unit=B which especially fits short running programs; however, note that also other choices are possible). The memory peak is indicated by a bar composed of "#" symbols and below the plot, in the list of detailed snapshots, snapshot number 3 is signified as peak. In the following list of snapshot-details one can see that for snapshot 3, the "useful-heap" memory allocated amounts to 5.200B. This makes sense, since an integer needs 4B, and in our program the memory peak is given by arrays myArray_a (300*4B) and myArray_b (1000*4B) beeing allocated at once. Also, note that the whole sequence of "useful-heap" values might be reconstructed by reading through the program myExample_malloc.c.

Monitoring heap and stack memory via massif

Note that valgrind can also be used to monitor not only the heap memory, i.e. the memory addressed by dynamic allocation, but also the stack memory, from which, e.g., the memory for local data is taken. To enable this the command line option --stack=yes needs to be set. For the above example this then reads

 valgrind --tool=massif --stack=yes --time-unit=B ./myExample_malloc

yielding a massif data file, here named massif.out.7450, which might be postprocessed according to

 ms_print massif.out.7450

to result in more details of the heap and stack memory consumption of the input program. The full output reads is of course more detailed than the one shown in the preceeding paragraph. Here, only the memory consumption graph is shown for an illustrative purpose:

 
--------------------------------------------------------------------------------
Command:            ./myExample_malloc
Massif arguments:   --stacks=yes --time-unit=B
ms_print arguments: massif.out.7450
--------------------------------------------------------------------------------


    KB
6.078^                                                              ##        
     |                                                              #         
     |                                                              #         
     |                                                              #         
     |                                                              #         
     |                                                              #         
     |                                                              #         
     |                                                              #         
     |         @                                                    #         
     |      @::@                                                    #         
     |      @ :@                                                    #         
     |  : ::@ :@                                                    #         
     |  : : @ :@:                                                   #         
     |  : : @ :@::                                                  #         
     |  : : @ :@::    :  :: :::  :::: ::   :::::::: @@:::@:::       #         
     |  ::: @ :@::  ::::::::: :::::: :: ::::::: :: :@ :::@::       :# :       
     | :::: @ :@:::@: ::::::: :: ::: :: : ::::: :: :@ :::@::       :# :       
     | :::: @ :@:::@: ::::::: :: ::: :: : ::::: :: :@ :::@::       :# ::      
     | :::: @ :@:::@: ::::::: :: ::: :: : ::::: :: :@ :::@::       :# ::      
     | :::: @ :@:::@: ::::::: :: ::: :: : ::::: :: :@ :::@:: :::::::# ::::::@:
   0
+----------------------------------------------------------------------->KB
     0                                                                   216.2

Number of snapshots: 66
 Detailed snapshots: [5, 8, 12, 37, 41, 51, 52, 53 (peak), 63]
  

Note that now, the memory peak, again given by arrays myArray_a (300*4B) and myArray_b (1000*4B) beeing allocated at once, corresponds to snapshot number 53. Also note that most stack memory operations occur before the actual allocation/deallocation processes in the main routine of the programm are performed.

In this way, valgrind might be used to guide you in the process of requesting memory resources for the jobs you aim to submit.

Debugging

Details on how to debug malfunctioning programs on the HPC system can be found here.

Profiling

Details on how to profile your program using the GNU profiling tool gprof (and sprof in case you use want to profile code using shared libraries) can be found here.


Monitoring current resources usage of running job

This section will illustrate how to use the SGE command qrsh in order to determine the actual, current resource usage of a running job. Most likely the resources actually consumed by your application will differ from those that were allocated upon submission. The procedure outlined below might e.g. give you an idea how much memory your application really consumes. This in turn might help you to estimate more fitting resource requirements for future job submissions.

NOTE: The job I used to illustrate this procedure was not owned by me. In advance to writing this section I asked user diab3109 (the owner of the job) for permission.

Motivation

Consider a situation where you have a running job and you are interested in the precise status of the job. First you might use the SGE command qstat in order to obtain some details about the principal state of the job and the host it was scheduled to run on:

 
> qstat
job-ID  prior   name       user         state submit/start at     queue                  slots ja-task-ID 
---------------------------------------------------------------------------------------------------------
1097375 0.50500 n5E4_0_4   diab3109     r     11/15/2013 11:27:33 mpc_std_long.q@mpcs008     1   
  

Now, just for arguments, say you are interested in the amount of memory your application currently consumes. For the subsequent steps it is important to first determine the execution host that handles your computing request. To get a grip on this issue have a look at the queue instance that was feasible to run the job. Here, the queue instance reads mpc_std_long.q@mpcs008. In particular this tells that the application is running on host mpcs008. The remaining steps can be summarized as follows:

  • logon to the computing node that hosts your job
  • use the top command to filter for the process ID that corresponds to your job
  • list e.g. the status file indexed by this process ID

Start interactive session to logon to execution host

In order to proceed you need to submit an interactive session to SGE via the command qrsh. This will direct you to one of the hosts that are feasible to run interactive sessions through the particular queue mpc_xtr_ctrl.q. To be more precise, the details of this queue read

 
> qconf -sq mpc_xtr_ctrl.q
qname                 mpc_xtr_ctrl.q
hostlist              @mpcx
seq_no                12700,[mpcs125.mpinet.cluster=12725], \
                      [mpcs126.mpinet.cluster=12726], \
                      [mpcs127.mpinet.cluster=12727], \
                      [mpcs128.mpinet.cluster=12728], \
                      [mpcs129.mpinet.cluster=12729], \
                      [mpcs130.mpinet.cluster=12730]
load_thresholds       NONE
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 INTERACTIVE
ckpt_list             NONE
pe_list               NONE
rerun                 FALSE
slots                 12
tmpdir                /scratch
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            herousers
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  00:10:00
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                50M
  

Hence, you will be directed to one of the hosts mpc125 through mpc130. From the keywords h_vmem and h_rt it is evident that such an interactive session will be limited in memory to 50Mb and is only allowed to run for 10 minutes, respectively. To start the interactive session you simply need to type qrsh. Note that since the requested interactive session will be recognized by SGE it might take a short while until the session is granted:

 
alxo9476@hero01:~$ 
Last login: Tue Nov 26 12:39:01 2013 from hero01.mpinet.cluster
alxo9476@mpcs125:~$
  

Apparently I was directed to the particular node mpcs125. From any terminal opened on one of the submission hosts you can verify that the interactive session is recognized by SGE:

 
alxo9476@hero01:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                  slots ja-task-ID 
---------------------------------------------------------------------------------------------------------
1105004 0.50500 QRLOGIN    alxo9476     r     11/26/2013 12:50:12 mpc_xtr_ctrl.q@mpcs125     1        
  

Now, lets logon to the computing node that hosts the job we are interested to monitor, i.e. node mpcs008. Therefore you might simply type

 
alxo9476@mpcs125:~$ ssh mpcs008
Last login: Tue Nov 26 12:56:10 2013 from mpcs125.cm.cluster
alxo9476@mpcs008:~$
  

If you logon to the node for the very first time you will get a message similar to Warning: Permanently added 'mpcs008,10.141.3.8' (RSA) to the list of known hosts.

Use top to filter for a jobs process ID

Once you are logged in to the execution host that processes the job you are interested in, you can use the common unix tool top to filter for that job. You can precisely filter for those processes that belong to a particular owner by specifying his/her user name using the -u option similar to:

 
alxo9476@mpcs008:~$ top -u "diab3109"

top - 13:06:52 up 84 days, 22:06,  1 user,  load average: 10.30, 9.90, 9.86
Tasks: 245 total,  11 running, 234 sleeping,   0 stopped,   0 zombie
Cpu(s): 78.8%us,  0.8%sy,  0.0%ni, 15.7%id,  4.4%wa,  0.1%hi,  0.2%si,  0.0%st
Mem:  24659208k total, 23530528k used,  1128680k free,    19364k buffers
Swap:  1998840k total,    22244k used,  1976596k free, 13396936k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                 
10210 diab3109  25   0 54112  31m  828 R 100.1  0.1  15936:50 n5E4_0_4.out                                                                            
10191 diab3109  22   0 84092 2492 1892 S  0.0  0.0   0:00.00 bash                                                                                     
  

You can get a more elaborate display of the COMMAND that corresponds to a given process by simply typing c. Here the list of processes then elaborates to

 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                 
10210 diab3109  25   0 54112  31m  828 R 100.1  0.1  15938:57 ./n5E4_0_4.out                                                                          
10191 diab3109  22   0 84092 2492 1892 S  0.0  0.0   0:00.00 -bash /cm/shared/apps/sge/current/default/spool/mpcs008/job_scripts/1097375    
  

The first process (PID: 10210) refers to the binary that is executed and the second process (PID: 10191) details the job submission script (residing in the spool directory of the execution host) that was used to startup the job. You might filter that submission script to see which resources where requested during the submission procedure. E.g. if you are iterated in the precise values for h_rt and h_vmem that where allocated you might type

 
alxo9476@mpcs008:~$ cat /cm/shared/apps/sge/current/default/spool/mpcs008/job_scripts/1097375  | grep "#$ -l"
#$ -l longrun=true
#$ -l h_vmem=500M
#$ -l h_fsize=200M
  

So, apparently 500Mb where allocated for the job and the resource option longrun was set to true (explaining why the job was scheduled for an instance of the queue mpc_std_long.q).

At times it is also interesting to see on which CPU a particular process runs. Therefore, within an active top session you might simply type in the sequence f j <space> to get an individual row (headed by the letter P) that specifies the CPU for each process. Additionally, by simply typing 1 you get an account of the current usage of all available CPUs. The result might look as listed below:

 
top - 13:19:07 up 84 days, 22:18,  1 user,  load average: 10.11, 10.17, 10.04
Tasks: 245 total,  11 running, 234 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 97.3%us,  2.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  : 97.7%us,  2.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  : 97.0%us,  3.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.7%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  24659208k total, 23519928k used,  1139280k free,    16752k buffers
Swap:  1998840k total,    22244k used,  1976596k free, 12603656k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+   P COMMAND                                                                              
10210 diab3109  25   0 54112  31m  828 R 100.1  0.1  15949:05 10 n5E4_0_4.out                                                                         
10191 diab3109  22   0 84092 2492 1892 S  0.0  0.0   0:00.00  7 bash              
  

Lets briefly clarify the different fields of the CPU usage lines. E.g., CPU 10 (the one hosting process 10210) currently shows

 Cpu10 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

From left to right the individual fields read:

  • %us = CPU usage by processes from user
  • %sy = CPU usage by system and kernel processes
  • %ni = CPU usage by processes without standard priority (ni stands for nice)
  • %id = CPU is idle
  • %wa = time during which CPU had to wait due to I/O access
  • %hi = CPU usage due to hardware interrupts
  • %si = CPU usage due to software interrupts

To support intuition: a high value of %us and a small value of %id are good. For the job we are interested in the respective values read 100%us and 0%id, respectively. In contrast to this, a large value of %wa would e.g. point out a high disk activity which could be due to extensive swapping.

List a jobs status file

So as to get an idea about the current memory usage of your running job and the maximal amount of memory used so far, you can use the process id (PID) of a job and list its status file located in the standard /proc folder (see here) on the execution host. Within that folder, the status file can be found in a subfolder named after the PID, see:

 
alxo9476@mpcs008:~$ cat /proc/10210/status 
Name:	n5E4_0_4.out
State:	R (running)
SleepAVG:	0%
Tgid:	10210
Pid:	10210
PPid:	10191
TracerPid:	0
Uid:	20438	20438	20438	20438
Gid:	12400	12400	12400	12400
FDSize:	256
Groups:	12000 12400 12402 21000 50126 
VmPeak:	   54112 kB
VmSize:	   54112 kB
VmLck:	       0 kB
VmHWM:	   31860 kB
VmRSS:	   31832 kB
VmData:	   43592 kB
VmStk:	      88 kB
VmExe:	      36 kB
VmLib:	    2168 kB
VmPTE:	     108 kB
StaBrk:	03066000 kB
Brk:	03087000 kB
StaStk:	7fffffffe140 kB
Threads:	1
SigQ:	0/212992
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000000000000
SigCgt:	00000000000040b6
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
Cpus_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00ffffff
Mems_allowed:	00000000,00000003
  

This status file simply is a summary of stat (yielding the process status) and statm (yielding further information about the process status) in human readable form. An explanation of all the keywords listed therein is provided here. Most importantly, the peak memory usage and the current memory usage are summarized by the keywords VmPeak and VmSize. Here, the job uses approximately 54Mb, which is way less than the amount of 500Mb allocated during the submission procedure. However, the allocated amount of 500Mb is still smaller than the default value of 1.2Gb. A job as slim as this one barely blocks any resources that might be needed by other users. Hence, the observed discrepancy between allocated memory and actually used memory can be tolerated without complaints.