Difference between revisions of "Brief Introduction to HPC Computing"
Albensoeder (talk | contribs) |
|||
(12 intermediate revisions by one other user not shown) | |||
Line 489: | Line 489: | ||
####### which parallel environment to use, and number of slots | ####### which parallel environment to use, and number of slots | ||
#$ -pe openmpi 12 | #$ -pe openmpi 12 | ||
####### enable resource reservation (to prevent starving of parallel jobs) | ####### enable resource reservation (to prevent starving of parallel jobs) | ||
#$ -R y | #$ -R y | ||
Line 1,265: | Line 1,262: | ||
= Monitoring current | = Monitoring the current resource usage for a running job = | ||
This section will illustrate how to use the SGE command <tt>qrsh</tt> | This section will illustrate how to use the SGE command <tt>qrsh</tt> to start an interactive session to determine the actual, current resource usage of a running job. | ||
Most likely the resources actually consumed by your application will differ from those that were allocated upon submission. | Most likely the resources actually consumed by your application will differ from those that were allocated upon submission. | ||
The procedure outlined below might e.g. give you an idea how much ''memory'' your application really consumes. This in turn might help | The procedure outlined below might e.g. give you an idea how much ''memory'' your application really consumes. This in turn might help | ||
Line 1,273: | Line 1,270: | ||
'''NOTE:''' The job I used to illustrate this procedure was not owned by me. In advance to writing this | '''NOTE:''' The job I used to illustrate this procedure was not owned by me. In advance to writing this | ||
section I asked user diab3109 (the owner of the job) for permission. | section I asked user <tt>diab3109</tt> (i.e. the owner of the job) for permission. | ||
== Motivation == | == Motivation == | ||
Line 1,383: | Line 1,380: | ||
If you logon to the node for the very first time you will get a message similar to <tt>Warning: Permanently added 'mpcs008,10.141.3.8' (RSA) to the list of known hosts.</tt> | If you logon to the node for the very first time you will get a message similar to <tt>Warning: Permanently added 'mpcs008,10.141.3.8' (RSA) to the list of known hosts.</tt> | ||
== Use <tt>top</tt> to filter for | == Use <tt>top</tt> to filter for a jobs process ID == | ||
Once you are logged in to the execution host that processes the job you are interested in, you can use the common unix tool <tt>top</tt> to filter for that job. | Once you are logged in to the execution host that processes the job you are interested in, you can use the common unix tool <tt>top</tt> to filter for that job. | ||
You can precisely filter for those processes that belong to a particular owner by specifying his/her user name using the <tt>-u </tt> option similar to: | You can precisely filter for those processes that belong to a particular owner by specifying his/her user name using the <tt>-u </tt> option similar to: | ||
Line 1,444: | Line 1,441: | ||
10191 diab3109 22 0 84092 2492 1892 S 0.0 0.0 0:00.00 7 bash | 10191 diab3109 22 0 84092 2492 1892 S 0.0 0.0 0:00.00 7 bash | ||
</nowiki> | </nowiki> | ||
Lets briefly clarify the different fields of the CPU usage lines. E.g., CPU 10 (the one hosting process 10210) currently shows | |||
Cpu10 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st | |||
From left to right the individual fields read: | |||
* %us = CPU usage by processes from user | |||
* %sy = CPU usage by system and kernel processes | |||
* %ni = CPU usage by processes without standard priority (ni stands for nice) | |||
* %id = CPU is idle | |||
* %wa = time during which CPU had to wait due to I/O access | |||
* %hi = CPU usage due to hardware interrupts | |||
* %si = CPU usage due to software interrupts | |||
To support intuition: a high value of %us and a small value of %id are good. For the job we are interested in the | |||
respective values read 100%us and 0%id, respectively. In contrast to this, a large value of %wa would e.g. point out a | |||
high disk activity which could be due to extensive swapping. | |||
== List a jobs status file == | |||
So as to get an idea about the current memory usage of your running job and the maximal amount of memory used so far, you can | |||
use the process id (PID) of a job and list its status file located in the standard <tt>/proc</tt> folder (see [http://man7.org/linux/man-pages/man5/proc.5.html here]) on the execution host. Within | |||
that folder, the status file can be found in a subfolder named after the PID, see: | |||
<nowiki> | |||
alxo9476@mpcs008:~$ cat /proc/10210/status | |||
Name: n5E4_0_4.out | |||
State: R (running) | |||
SleepAVG: 0% | |||
Tgid: 10210 | |||
Pid: 10210 | |||
PPid: 10191 | |||
TracerPid: 0 | |||
Uid: 20438 20438 20438 20438 | |||
Gid: 12400 12400 12400 12400 | |||
FDSize: 256 | |||
Groups: 12000 12400 12402 21000 50126 | |||
VmPeak: 54112 kB | |||
VmSize: 54112 kB | |||
VmLck: 0 kB | |||
VmHWM: 31860 kB | |||
VmRSS: 31832 kB | |||
VmData: 43592 kB | |||
VmStk: 88 kB | |||
VmExe: 36 kB | |||
VmLib: 2168 kB | |||
VmPTE: 108 kB | |||
StaBrk: 03066000 kB | |||
Brk: 03087000 kB | |||
StaStk: 7fffffffe140 kB | |||
Threads: 1 | |||
SigQ: 0/212992 | |||
SigPnd: 0000000000000000 | |||
ShdPnd: 0000000000000000 | |||
SigBlk: 0000000000000000 | |||
SigIgn: 0000000000000000 | |||
SigCgt: 00000000000040b6 | |||
CapInh: 0000000000000000 | |||
CapPrm: 0000000000000000 | |||
CapEff: 0000000000000000 | |||
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00ffffff | |||
Mems_allowed: 00000000,00000003 | |||
</nowiki> | |||
This status file simply is a summary of <tt>stat</tt> (yielding the process status) and <tt>statm</tt> (yielding further information about the process status) in human readable form. An explanation | |||
of all the keywords listed therein is provided [http://wiki.directi.com/display/tu/Understanding+Processes+in+Linux here]. Most importantly, the peak memory usage and the current memory usage | |||
are summarized by the keywords <tt>VmPeak</tt> and <tt>VmSize</tt>. Here, the job uses approximately 54Mb, which is way less than the amount of 500Mb allocated during the submission procedure. | |||
However, the allocated amount of 500Mb is still smaller than the default value of 1.2Gb. A job as slim as this one barely blocks any resources that might be needed by other users. Hence, the observed discrepancy | |||
between allocated memory and actually used memory can be tolerated without complaints. | |||
Once you are done, you might simply log out from the execution host using <tt>exit</tt>. This will return you to the host you where directed to upon submission of the interactive session (here: <tt>mpcs125</tt>). Exiting again | |||
will return you to the submission host you initially started from. |
Latest revision as of 17:28, 22 January 2015
A brief introduction to the usage of the HPC facilities at the university of Oldenburg, especially targeted at new and unexperienced HERO and FLOW users is given below. The introduction is based on various minimal examples that illustrate how to compile non-parallel and parallel programs as well as how to submit jobs via SGE and monitor the status of submitted jobs.
A simple serial program
Example using the GNU compiler collection (gcc)
Consider the following "Hello World!" C program, called myExample.c:
#include <stdio.h> int main(int argc, char *argv[]) { fprintf(stdout,"Hello World!\n"); return 0; }
In brief: once compiled an invoked, it only prints the string "Hello World!" to the standard out-stream.
Compiling the program
Once you log in to the system, you are on one of the two nodes hero01 and hero02 for users of HERO or flow01 and flow02 for users of FLOW. This is where you should compile your programs and from where you should submit your jobs from. After the log in, the GNU compiler collection is loaded by default. However, we will explicitly go through the steps needed to load a certain compiler, for that matter. Therefore, lets pretend the compiler we need is not loaded already. In order to be able to use a certain compiler one needs to load the respective user environment, specified by a particular module.
To get a list of the modules which are loaded currently, just type
module list
For me, this yields this triggers the output
Currently Loaded Modulefiles: 1) shared 2) sge/6.2u5p2
Further, to get a list of all available gcc related modules, you might type
module avail gcc
to obtain
------------------- /cm/shared/modulefiles -------------------- gcc/4.3.4 gcc/4.6.3 gcc/4.7.1
In the subsequent example we will use gcc/4.7.1 to compile the program above. More information about that particular module can be obtained by typing
module show gcc/4.7.1
Finally, to load the module just type
module load gcc/4.7.1
which creates the desired user environment. You can check whether the proper compiler is loaded by typing which gcc. This now yields
/cm/shared/apps/gcc/4.7.1/bin/gcc
Hence everything worked well and the stage is properly set in order to compile the example program by means of the statement
gcc myExample.c -o myExample
to yield the executable myExample
Submitting a job
In order to submit the job via SGE, you might use the following job submission script, called myProg.sge:
#!/bin/bash ####### which shell to use #$ -S /bin/bash ####### change to directory where job was submitted from #$ -cwd ####### maximum walltime of the job (hh:mm:ss) #$ -l h_rt=0:10:0 ####### memory per job slot #$ -l h_vmem=300M ####### disk space #$ -l h_fsize=100M ####### name of the job #$ -N basic_test ####### merge stdout and stderr #$ -j y ./myExample
The resource allocation statements that are used in the job submission script above are explained here. Now, typing
qsub myProg.sge
enqueues the job, assigning the jobId 704701 in this case.
Altering resource requirements
If you submitted a job and realize afterwards that you requested non-adequat resources, you basically have two choices. You can either delete the job using the command qdel, amend your job submission script and submit the job again, or you can use the command qalter which allows you to modify the resource list of your job using a statement similar to
qalter -l h_vmem=2G -l h_fsize=10G -l h_rt=1:00:0 <JobId>
where JobId refers to the unique id of your job. Note that qalter overwrites the full resource list, hence you need to specify all the resource keywords that also appear in your original job submission script.
Checking the status of the job
As soon as the job is enqueued one can check its status by typing qstat. Immediately after submission one might get the output
job-ID prior name user state submit/start at queue slots ja-task-ID --------------------------------------------------------------------------------------------------------- 704713 0.00000 basic_test alxo9476 qw 05/15/2013 18:18:46 1
According to this, the job with ID 704713 has priority 0.00000 and resides in state qw loosely translated to "enqueued and waiting". Also, the above output indicates that the job requires a number of 1 slots. The column for the ja-task-ID, referring to the id of the particular task stemming from the execution of a job array (we don't work through a job array since we submitted a single job), is actually empty. Soon after, the priority of the job will take a value in between 0.5 and 1.0 (usually only slightly above 0.5), slightly increasing until the job starts. Here, after waiting a few seconds qstat triggers the output
job-ID prior name user state submit/start at queue slots ja-task-ID --------------------------------------------------------------------------------------------------------- 704713 0.50500 basic_test alxo9476 r 05/15/2013 18:19:15 mpc_std_shrt.q@mpcs001 1
From the name of the queue, here
mpc_std_shrt.q@mpcs001
one can already infer a lot. Guided by the resources specified in the job submission script, the scheduler assigned the job to the queue-Instance mpc_std_shrt.q on the host mpcs001, where the job is executed. Note that dependend of the load of the cluster the submitted job could stay a longer time in the wainting state 'qw' until it switch to the execution state 'r'.
Details for finished jobs
After a job has finished one can obtain further information about the resources actually required by the job by using the qacct utility. The only thing one has to provide is the Id of the job. In the current example, the jobId provided by SGE was 704713 and typing
qacct -j 704713
yields a list of resources actually used by the application:
============================================================== qname mpc_std_shrt.q hostname mpcs001.mpinet.cluster group ifp owner alxo9476 project NONE department defaultdepartment jobname basic_test jobnumber 704713 taskid undefined account sge priority 0 qsub_time Wed May 15 18:18:46 2013 start_time Wed May 15 18:19:17 2013 end_time Wed May 15 18:19:20 2013 granted_pe NONE slots 1 failed 0 exit_status 0 ru_wallclock 3 ru_utime 0.025 ru_stime 0.030 ru_maxrss 4136 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 7497 ru_majflt 14 ru_nswap 0 ru_inblock 0 ru_oublock 0 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 404 ru_nivcsw 27 cpu 0.055 mem 0.000 io 0.000 iow 0.000 maxvmem 84.773M arid undefined
A detailed description of the keys can be found by typing looking up the man page of the SGE accounting tool via man accounting.
Using local (scratch) storage for I/O intense serial jobs (only for HERO users)
Consider a situation where your particular application is rather I/O intense so that the speed of your program suffers from the amount of I/O operations that strain the global file system. Examples might be irregular I/O patterns at a fast pace or an application that has to create, open, close and delete many files. As a remedy in order to overcome such problems you might benefit from using a local scratch disk of an execution host on which your program is actually run. This reduces the amount of network traffic and hence reduces the strain on the global file system. The subsequent example illustrates how to access and use the local storage on a given host for the purpose of storing data during the runtime of the program. In the example, after the program terminates, the output data is copied to the working directory from which the job was submitted from and the local file system on the host is cleaned out. For this matter, consider the examplary C program myExample_tempdir.c
#include <stdio.h> int main(int argc, char *argv[]) { FILE *myFile; myFile=fopen("my_data/myData.out","w"); fprintf(myFile,"Test output to local scratch directory\n"); fclose(myFile); }
which, just for arguments (and to fully explain the job submission script below), is contained in the working directory
$HOME/wmwr/my_examples/tempdir_example/
The program assumes that there is a directory my_data in the current working directory to which the file myData.out with a certain content (here the sequence of characters Test output to local scratch directory) will be written.
In oder to compile the program via the current gcc compiler, you could first set the stage by loading the proper modules, e.g.,
module clear module load sge module load gcc/4.7.1
and then compile via
gcc myExample_tempdir.c -o myExample_tempdir
to yield the binary myExample_tempdir.
At this point bear in mind that we do not want to execute the binary by hand right away! Instead, we would like to leave it to SGE to determine a proper queue instance (guided by the resources we subsequently will specify for the job) on a host with at least one free slot, where the job will be executed. A proper job submission script, here called myProg_tempdir.sge, that takes care of creating the folder my_data needed by the program myExample_tempdir in order to store its output in a temporal directory on the executing host reads
#!/bin/bash ####### which shell to use #$ -S /bin/bash ####### change to directory where job was submitted from #$ -cwd ####### maximum walltime of the job (hh:mm:ss) #$ -l h_rt=0:10:0 ####### memory per job slot #$ -l h_vmem=100M ####### since working with local storage, no need to request disk space ####### name of the job #$ -N tmpdir_test ####### merge stdout and stderr #$ -j y ####### change current working directory to the local /scratch/<jobId>.<x>.<qInst> ####### directory, available as TMPDIR on the executing host with HOSTNAME cd $TMPDIR ####### write details to <jobName>.o<jobId> output file echo "HOSTNAME = " $HOSTNAME echo "TMPDIR = " $TMPDIR ####### create output directory on executing host (parent folder is TMPDIR) mkdir my_data ####### run program $HOME/wmwr/my_examples/tempdir_example/myExample_tempdir ####### copy the output to the directory the job was submitted from cp -a ./my_data $HOME/wmwr/my_examples/tempdir/
Note that in the above job submission script there is no need to request disk space by setting the resource h_fsize since we are working with local storage provided by the execution host. Submitting the script via
qsub myProg_tempdir.sge
enqueues the respective job, here having jobId 703914. After successful termination of the job, the folder my_data is moved to the working directory from which the job was originally submitted from. Also, the two job status files tmpdir_test.e703914 and tmpdir_test.o703914 where created that might contain further details associated with the job. The latter file should contain the name of the host on which the job actually ran and the name of the temporal directory. And indeed, cat tmpdir_test.o703914 reveals the file content
HOSTNAME = mpcs001 TMPDIR = /scratch/703914.1.mpc_std_shrt.q
Further, the file my_data/myData.out contains the line
Test output to local scratch directory
as expected. Note that the temporary directory $TMPDIR (here: /scratch/703914.1.mpc_std_shrt.q) on the execution host (here: mpcs001) is automatically cleaned out. Finally, note that since $TMPDIR is created on a single host, the procedure outlined above works well only if your application runs on a single host. I.e., it is feasible for jobs that either request only a single slot (i.e. non-parallel jobs) or for parallel jobs for which all requested slots fit onto the same host. However, due to the "fill up" allocation rule obeyed by SGE, this cannot be guaranteed in general.
Note for FLOW: FLOW has no local file system. So this example won't work on FLOW!
Setting up array jobs
Consider a situation where you need to re-run your program several times, possibly for different sets of input data, for that matter. Then you might benefit from setting up an array job. Here, we will work through a simple example that illustrates how to set up such an array job. Therefore, consider the simple C program called myExample_jobArray.c:
#include <stdio.h> int main(int argc, char *argv[]) { FILE *myFile; myFile=fopen("myData.out","a"); fprintf(myFile,"%s\n",argv[1]); fflush(myFile); fclose(myFile); return 0; }
In brief: once compiled an invoked, it appends a string (corresponding to the first command-line argument) to the file myData.out. For arguments, say you just logged in to the system and the default gcc compiler is loaded (you can check this by typing: which gcc), you might compile the above program via
gcc myExample_jobArray.c -o myExample_jobArray
to yield the executable myExample_jobArray. Further, say you would like to run the program ten times, considering the command line arguments myTask_01 through myTask_10. The first thing you might do in order to set up an array job that summarizes these ten individual jobs is to create an auxiliary file myArgList.txt with the content
myTask_01 myTask_02 myTask_03 myTask_04 myTask_05 myTask_06 myTask_07 myTask_08 myTask_09 myTask_10
Now, a proper job submission script that works through this file line-by-line is given by the file myProg_jobArray.sge:
#!/bin/bash ####### which shell to use #$ -S /bin/bash ####### change to directory where job was submitted from #$ -cwd ####### maximum walltime of the job (hh:mm:ss) #$ -l h_rt=0:10:0 ####### memory per job slot #$ -l h_vmem=300M ####### disk space #$ -l h_fsize=100M ####### name of the job #$ -N jobArray_test ####### merge of stdout and stderr #$ -j y ####### on FLOW you have to uncomment following line!!! Otherwise you block a complete node for a single job. # #$ -l excl_flow=false #$ -t 1-10:1 #$ -tc 2 ./myExample_jobArray $(sed -n ${SGE_TASK_ID}'p' myArgList.txt)
therein, the option -t iniVal-finVal:stepSize initializes the variable SGE_TASK_ID to successively take values from 1 to 10 in steps of 1. Further, the option -tc nJobs takes care that only a number of nJobs jobs run at a time. Note that for FLOW users you have to set
#$ -l excl_flow=false
by uncommenting the line in the example above. Finally, the array job can be submitted as
qsub myProg_jobArray.sge
this time assigning the jobId 704910. The state of the job can be checked by typing qstat which in this case yields the output
job-ID prior name user state submit/start at queue slots ja-task-ID --------------------------------------------------------------------------------------------------------- 704910 0.50500 jobArray_t alxo9476 r 05/16/2013 11:35:03 mpc_std_shrt.q@mpcs001 1 3 704910 0.50500 jobArray_t alxo9476 r 05/16/2013 11:35:03 mpc_std_shrt.q@mpcs002 1 4 704910 0.00000 jobArray_t alxo9476 qw 05/16/2013 11:33:52 1 5-10:1
One can see that only two jobs are running, the remaining jobs are still enqueued (due to the use of -tc 2 in the job submission script). Further, qstat lists additional information on the integer identifier associated to a particular job array task processed (shown in the rightmost column). Thus, from the above output it is evident that jobs with task-ID 3 and 4 are currently running, while jobs 5 through 10 are still enqueued and waiting. Consequently, the jobs with task-ID 1 and 2 are already finished. Once all tasks summarized by the array job are finished, the data file myData.out contains
myTask_01 myTask_02 myTask_03 myTask_04 myTask_05 myTask_06 myTask_07 myTask_08 myTask_09 myTask_10
Note that if several tasks are processed at a time it might be that they don't finish "in order", i.e., it might happen that, say, task 4 finishes earlier than task 3. In such a (common) situation it might happen that the results listed in the myData.out file are not in the same order as the respective input arguments listed in the myArgList.txt file.
As pointed out above, the utility qacct can be used to retrieve further information on finished jobs. This holds also for array jobs. The individual jobs summarized by an array job are enumerated by means of a job task-ID (as discussed above). Consequently, for a given jobId (here: 704910), qacct outputs a list of details for each taks processed during the execution of the respective array job. Here, to save space, we only filter for the hostname, task-ID and exit status of the different tasks by writing
qacct -j 704910 | grep "hostname\|taskid\|exit_status\|="
this yields
============================================================== hostname mpcs001.mpinet.cluster taskid 1 exit_status 0 ============================================================== hostname mpcs002.mpinet.cluster taskid 2 exit_status 0 ============================================================== hostname mpcs001.mpinet.cluster taskid 3 exit_status 0 ============================================================== hostname mpcs002.mpinet.cluster taskid 4 exit_status 0 ============================================================== hostname mpcs001.mpinet.cluster taskid 5 exit_status 0 ============================================================== hostname mpcs002.mpinet.cluster taskid 6 exit_status 0 ============================================================== hostname mpcs001.mpinet.cluster taskid 7 exit_status 0 ============================================================== hostname mpcs002.mpinet.cluster taskid 8 exit_status 0 ============================================================== hostname mpcs001.mpinet.cluster taskid 9 exit_status 0 ============================================================== hostname mpcs002.mpinet.cluster taskid 10 exit_status 0
A simple parallel program
In order to illustrate how to compile, submit and monitor a basic parallel C program that uses the Message Passing Interface (MPI), consider the following code contained in the file myHelloWorld_mpi.c:
#include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int numprocs, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); printf("Process %d (out of %d) on host %s\n",rank, numprocs, processor_name); MPI_Finalize(); }
To sum it up: if the program is compiled, requested to use a number of, say, M slots and finally runs, each process writes out a certain statement that specifies its rank and the name of its execution host to the standard out-stream.
Note that, in order to compile and submit the program, you might use one out of a variety of parallel environments (PEs). You can get an overview of all possible PEs by typing
qconf -spl
which, however, is not listed here. Out of the many possible choices (some of them fit rather special needs for certain applications which a typical HPC user does not even have to worry about) we will subsequently consider the PEs openmpi and impi in more detail.
Example using the openmpi parallel environment
So as to use the openmpi parallel environment, you need to load the proper modules prior to compiling the above program (see here). To see which openmpi-modules are available you might type module avail openmpi, yielding
------------------- /cm/shared/modulefiles -------------------- openmpi/1.4.3/gcc/64/4.3.4 openmpi/1.4.3/intel/ics/2011.0.013/64 openmpi/1.4.3/intel/ics/64/2011.0.013 openmpi/1.6.2/gcc/64/4.7.1 openmpi/1.6.2-debug/gcc/64/4.7.1 openmpi/gcc/64/1.4.2 openmpi/gcc/64/1.4.3 openmpi/intel/64/10.1.015/1.4.5 openmpi/intel/compiler/64/10.1.015/1.4.5 openmpi/intel/ics/64/2011.0.013/1.4.3 openmpi/open64/64/1.4.2
Subsequently we will use openmpi version 1.6.2, built with gcc version 4.7.1.
Compiling the program
To load the proper modules you might type:
module unload gcc module load gcc/4.7.1 module load openmpi/1.6.2/gcc/64/4.7.1
Note that it might be necessary to unload other modules that result in a conflict, first! More information on module details and possible conflicts can be found by typing, e.g.,
module show openmpi/1.6.2/gcc/64/4.7.1
Also, a list of all currently loaded modules is available by typing
module list
Once the proper modules are loaded you might proceed to compile the program via
mpicc myHelloWorld_mpi.c -o myHelloWorld_openMpi
You might first check whether mpicc indeed refers to the desired compiler by typing which mpicc, which in this case yields
/cm/shared/apps/openmpi/1.6.2/gcc/64/4.7.1/bin/mpicc
so everything is fine and the stage is properly set!
Submitting a job
In order to submit the job via SGE, specifying a parallel environment (PE) that fits your choice (here: openMpi), you might use the following job submission script, called myProg_openMpi.sge:
#!/bin/bash ####### which shell to use #$ -S /bin/bash ####### change to directory where job was submitted from #$ -cwd ####### maximum walltime of the job (hh:mm:ss) #$ -l h_rt=0:10:0 ####### memory per job slot #$ -l h_vmem=1000M ####### disk space #$ -l h_fsize=1G ####### which parallel environment to use, and number of slots #$ -pe openmpi 12 ####### enable resource reservation (to prevent starving of parallel jobs) #$ -R y ####### name of the job #$ -N openMpi_test module unload gcc module load gcc/4.7.1 module load openmpi/1.6.2/gcc/64/4.7.1 # for HERO users mpirun --mca btl ^openib,ofud -machinefile $TMPDIR/machines -n $NSLOTS ./myHelloWorld_openMpi # for FLOW users: use following line and please comment the line above out # mpirun --mca btl openib,sm,self -machinefile $TMPDIR/machines -n $NSLOTS ./myHelloWorld_openMpi
Most of the resource allocation statements should look familiar to you (if not, see here). However, note that a few of them are required to ensure a proper submission of parallel jobs. E.g., you need to take care to use the proper PE: in the job submission script this is done by means of the statement
#$ -pe <parallel_environment> <num_slots>
wherein <parallel_environment> refers to the type of PE that fits your application and where <num_slots> specifies the number of desired slots for the parallel job. Here we decided to use openMpi, hence, the proper PE in reads openmpi. Further, in the above example a number of 12 slots is requested.
NOTE for FLOW: Please see the comments in the script for FLOW users. Additional hint: On FLOW you should use for parallel jobs always a multiple of 12 as number of slots to fill up the nodes which have 12 cores.
Now, typing
qsub myProg_openMpi.sge
enqueues the job, assigning the jobId 704398 in my case.
Checking the status of the job
Once the job starts to run, it is possible to infer from which hosts the 13 requested slots are accumulated by typing qstat -g t, which in my case yields
job-ID prior name user state submit/start at queue master ja-task-ID ---------------------------------------------------------------------------------------------------------- 704398 0.50735 openMpi_te alxo9476 r 05/15/2013 09:54:23 mpc_std_shrt.q@mpcs002 MASTER mpc_std_shrt.q@mpcs002 SLAVE mpc_std_shrt.q@mpcs002 SLAVE 704398 0.50735 openMpi_te alxo9476 r 05/15/2013 09:54:23 mpc_std_shrt.q@mpcs004 SLAVE mpc_std_shrt.q@mpcs004 SLAVE mpc_std_shrt.q@mpcs004 SLAVE mpc_std_shrt.q@mpcs004 SLAVE mpc_std_shrt.q@mpcs004 SLAVE 704398 0.50735 openMpi_te alxo9476 r 05/15/2013 09:54:23 mpc_std_shrt.q@mpcs006 SLAVE mpc_std_shrt.q@mpcs006 SLAVE 704398 0.50735 openMpi_te alxo9476 r 05/15/2013 09:54:23 mpc_std_shrt.q@mpcs008 SLAVE mpc_std_shrt.q@mpcs008 SLAVE mpc_std_shrt.q@mpcs008 SLAVE
Meanwhile the job has terminated successfully, there where 4 files created: openMpi_test.e704398, openMpi_test.o704398, openMpi_test.pe704398 and openMpi_test.po704398. In detail they contain:
- openMpi_test.po704398: the hostfile for the job which can be found in the spool directory for the MASTER process (which in this case is mpcs002), reading
-catch_rsh /cm/shared/apps/sge/current/default/spool/mpcs002/active_jobs/704398.1/pe_hostfile mpcs002.mpinet.cluster mpcs002.mpinet.cluster mpcs004.mpinet.cluster mpcs004.mpinet.cluster mpcs004.mpinet.cluster mpcs004.mpinet.cluster mpcs004.mpinet.cluster mpcs006.mpinet.cluster mpcs006.mpinet.cluster mpcs008.mpinet.cluster mpcs008.mpinet.cluster mpcs008.mpinet.cluster
- openMpi_test.pe704398: nothing (which is good!)
- openMpi_test.o704398: the (expected) program output, reading
Process 7 (out of 12) on host mpcs006 Process 8 (out of 12) on host mpcs006 Process 6 (out of 12) on host mpcs004 Process 3 (out of 12) on host mpcs004 Process 5 (out of 12) on host mpcs004 Process 2 (out of 12) on host mpcs004 Process 4 (out of 12) on host mpcs004 Process 10 (out of 12) on host mpcs008 Process 9 (out of 12) on host mpcs008 Process 11 (out of 12) on host mpcs008 Process 1 (out of 12) on host mpcs002 Process 0 (out of 12) on host mpcs002
- openMpi_test.e704398: if there are N hosts involved to run your application (here: N=4), there should be N-1 harmless error messages (each consisting of two lines) of the form
bash: module: line 1: syntax error: unexpected end of file bash: error importing function definition for `module' bash: module: line 1: syntax error: unexpected end of file bash: error importing function definition for `module' bash: module: line 1: syntax error: unexpected end of file bash: error importing function definition for `module'
This is a harmless, well known and documented error for the SGE version (6.2u5) used on the local HPC facilities (see here) which you might safely ignore.
Example using the impi parallel environment
So as to use the impi parallel environment, you need to load the proper modules prior to compiling the above program (see here). To see which intel mpi modules are available you might type module avail intel/impi, yielding
------------------- /cm/shared/modulefiles -------------------- intel/impi/32/4.0.1.007 intel/impi/4.0.1.007/64 intel/impi/32/4.1.0.024 intel/impi/64/4.0.1.007 intel/impi/4.0.1.007/32 intel/impi/64/4.1.0.024
Subsequently we will use the 64-bit intel Mpi library with the environment as specified by the module intel/impi/64/4.1.0.024.
Compiling the program
To load the proper modules you might type:
module unload openmpi module load intel/impi/64/4.1.0.024
Note that it might be necessary to unload other modules that result in a conflict, first! More information on module details and possible conflicts can be found by typing, e.g.,
module show intel/impi/64/4.1.0.024
Also, a list of all currently loaded modules is available by typing
module list
Once the proper modules are loaded you might proceed to compile the program via
mpicc myHelloWorld_mpi.c -o myHelloWorld_intelMpi
You might first check whether mpicc indeed refers to the desired compiler by typing which mpicc, which in this case yields
/cm/shared/apps/intel/ics/2013.0.028/impi/4.1.0.024/intel64/bin/mpicc
so everything is fine and the stage is properly set!
Submitting the program
In order to submit the job via SGE, specifying a parallel environment (PE) that fits your choice (here: impi), you might use the following job submission script, called myProg_intelMpi.sge:
#!/bin/bash ####### which shell to use #$ -S /bin/bash ####### change to directory where job was submitted from #$ -cwd ####### maximum walltime of the job (hh:mm:ss) #$ -l h_rt=0:10:0 ####### memory per job slot #$ -l h_vmem=1000M ####### disk space #$ -l h_fsize=1G ####### which parallel environment to use, and number of slots #$ -pe impi41 12 ####### enable resource reservation (to prevent starving of parallel jobs) #$ -R y ####### name of the job #$ -N intelMpi_test ####### merge stdout and stderr #$ -j y module load intel/impi/64/4.1.0.024 # for HERO users mpirun -bootstrap sge -machinefile $TMPDIR/machines -np $NSLOTS ./myHelloWorld_impi # for FLOW users: use following line by uncommenting and please comment line above out # mpirun -bootstrap sge -machinefile $TMPDIR/machines -np $NSLOTS -env I_MPI_FABRICS shm:ofa ./myHelloWorld_impi
Most of the resource allocation statements should look familiar to you (if not, see here). However, note that a few of them are required to ensure a proper submission of parallel jobs. E.g., you need to take care to use the proper PE: in the job submission script this is done by means of the statement
#$ -pe <parallel_environment> <num_slots>
wherein <parallel_environment> refers to the type of PE that fits your application and where <num_slots> specifies the number of desired slots for the parallel job. Here we decided to use impi, hence, the proper PE in reads impi. Further, in the above example a number of 12 slots is requested.
Now, typing
qsub myProg_intelMpi.sge
ensues the job, assigning the jobId 704648 in my case.
The parallel environment memory issue
As pointed out above, after a job has finished one can obtain further information about the resources actually required by the job by using the qacct utility. The only thing one has to provide is the Id of the job. To point out one specific issue of using PEs (in particular on HERO), reconsider the openmpi example above. In that example, the jobId provided by SGE was 704398. Typing qacct -j 704398 yields a list of resources actually used by the application:
============================================================== qname mpc_std_shrt.q hostname mpcs002.mpinet.cluster group ifp owner alxo9476 project NONE department defaultdepartment jobname openMpi_test jobnumber 704398 taskid undefined account sge priority 0 qsub_time Wed May 15 09:53:55 2013 start_time Wed May 15 09:54:24 2013 end_time Wed May 15 09:54:36 2013 granted_pe impi41 slots 12 failed 0 exit_status 0 ru_wallclock 12 ru_utime 0.602 ru_stime 0.459 ru_maxrss 20976 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 51650 ru_majflt 523 ru_nswap 0 ru_inblock 0 ru_oublock 0 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 14798 ru_nivcsw 1347 cpu 1.061 mem 0.030 io 0.002 iow 0.000 maxvmem 775.348M arid undefined
Consider the value of maxvmem used by the job. On the first sight this seems a bit odd, given that the application was a simple "hello world" program! However, if the processes that belong to a job (here: 12 processes) are distributed over several hosts (here: 4 hosts), the MASTER process has to handle all the SLAVE processes. Therefore it has to set up and maintain a connection to all the remote hosts which definitely costs some memory (easily 150 to 200M per host). However, note that these memory requirements accumulate for the MASTER process only, the SLAVE processes need less memory. Therefore, if one submits a large parallel job which might be executed on several hosts one has to make sure that the MASTER process does not run out of resources, therefore one has to allocate sufficient memory. Otherwise the job will be killed.
Example using the smp parallel environment
The smp parallel environment is somewhat special. It requires all the requested slots to be available on a single execution host, see here. Hence, as will be discussed below, special care has to be taken to properly specify the resouces needed for a job (NOTE: this section was motivated by the user ruxi6902).
A simple open MP example program
In order to illustrate how to compile, submit and monitor a basic open MP program for use with the smp parallel environment, consider the following code contained in the file myHelloWorld_omp.c:
#include <omp.h> #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { int nThreads, tId; /* START OF PARALLEL REGION */ /* Fork a team of threads */ #pragma omp parallel private(nThreads,tId) { /* Each thread has own, private tId variable */ tId = omp_get_thread_num(); printf("Hello World thread Id %d\n",tId); /* the following block of statements is only * * executed by the mater thread (which has * * tId==0 by default) */ if(tId==0){ nThreads = omp_get_num_threads(); printf("Number of threads = %d\n",nThreads); } } /* END OF PARALLEL REGION */ return 0; }
In summary:
Once compiled and run, the program starts off as serial code until the
block, signified as parallel region is met.
If the program is invoked for use with, say, M
slots,
the master thread creates a team of M
parallel threads.
Further, every thread executes all the code listed in the parallel region. In
the above code the open MP library, implementing several useful subroutines
for use with open MP, is used to obtain the individual thread Ids (which are
private to each thread), and the total number of threads (read from the master
thread). Note that a fairly complete open MP tutorial, from which this example
was adopted, is available here.
Compiling and submitting the program
Say you want to use the gcc compiler. Then you might first load the proper module, if it is not loaded by default, via
module load gcc/4.7.1
creating the desired user environment. The above open MP example program can then be compiled by means of
gcc -fopenmp -o myHelloWorld_smp myHelloWorld_omp.c
In order to submit the job via SGE, you might use the following job submission script, here called myProg_openMP.sge:
#!/bin/bash ####### which shell to use #$ -S /bin/bash ####### change to directory where job was submitted from #$ -cwd ####### maximum walltime of the job (hh:mm:ss) #$ -l h_rt=0:10:0 ####### memory per job slot #$ -l h_vmem=1000M ####### disk space #$ -l h_fsize=1G ####### which parallel environment to use, and number of slots #$ -pe smp 5 ####### enable resource reservation (to prevent starving of parallel jobs) #$ -R y ####### name of the job #$ -N openMp_test module unload gcc module load gcc/4.7.1 export OMP_NUM_THREADS=$NSLOTS ./myHelloWorld_smp
The resource allocation statements used in the job submission script are
explained
here.
At this point, note that open MP uses environment variables to controll the
execution of parallel jobs at runtime. However, setting the open MP
environment variables is as easy as setting other environment variables and it
depends on which shell you are actually using. Above we specified a bash
shell, thus, setting a value for the environment variable
OMP_NUM_THREADS
is done via
export OMP_NUM_THREADS=$NSLOTS
Now, typing
qsub myProg_openMp.sge
enqueues the job, assigning the jobId 749772 in this case.
As soon as the job is in state running, one can get an idea of where the
parallel threads are running. In this question the query qstat -g c
yields
job-ID prior name user state submit/start at queue master ja-task-ID ---------------------------------------------------------------------------------------------------------- 749772 0.50598 openMp_tes alxo9476 r 06/26/2013 16:14:17 mpc_std_shrt.q@mpcs105 MASTER mpc_std_shrt.q@mpcs105 SLAVE mpc_std_shrt.q@mpcs105 SLAVE mpc_std_shrt.q@mpcs105 SLAVE mpc_std_shrt.q@mpcs105 SLAVE mpc_std_shrt.q@mpcs105 SLAVE
As it appears, the job is executed on host mpcs105 with all parallel threads running on that single host (as it should be if one uses the smp parallel environment).
After the job has terminated successfully, there where 4 files created: openMp_test.pe749772, openMp_test.po749772, openMp_test.e749772, and openMp_test.o749772. The first two list the output related to the setup of the parallel environment and the latter two contain the direct output of the submitted program to the stdandard error and output stream, respectively. Only the very last file is non-empty, containing the lines
Hello World thread Id 2 Hello World thread Id 3 Hello World thread Id 4 Hello World thread Id 0 Hello World thread Id 1 Number of threads = 5
Further informatin about this job can, after termination of the job, of course be obtained using qacct. In this regard, the query
qacct -j 749772
yields the output
============================================================== qname mpc_std_shrt.q hostname mpcs105.mpinet.cluster group ifp owner alxo9476 project NONE department defaultdepartment jobname openMp_test jobnumber 749772 taskid undefined account sge priority 0 qsub_time Wed Jun 26 16:13:21 2013 start_time Wed Jun 26 16:14:18 2013 end_time Wed Jun 26 16:14:19 2013 granted_pe smp slots 5 failed 0 exit_status 0 ru_wallclock 1 ru_utime 0.058 ru_stime 0.043 ru_maxrss 4124 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 9494 ru_majflt 2 ru_nswap 0 ru_inblock 0 ru_oublock 0 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 423 ru_nivcsw 30 cpu 0.101 mem 0.001 io 0.000 iow 0.000 maxvmem 89.887M arid undefined
Again, note that by using the smp parallel environment, all slots are allocated on a single execution host (in this case the scheduler sent the job to the host mpcs105).
Further things to note
There are some things to note about the smp parallel environment:
- Memory requirenment for the master thread: Since the job does not run accross different execution hosts, the parallel environment memory issue illustrated above is not an issue here. In the previous paragraph you can see that by using 5 slots, the overall memory requirenments amount to roughly 90MB (for this basic program).
- Usage of local scratch storage: Since, by using the smp parallel environment, all slots of a submitted job are allocated on a single host you might consider to use the local scratch storage if you intend to submit I/O intense jobs, as discussed here.
- Take care to properly request resources: A standard host on HERO (comprising the nodes mpcs001 through mpcs130) offers 12 slots and an overall memory of 24GB. So, if you require more than that you need to request one of the big nodes, see here.
An example regarding the latter point: Just for arguments, say, you submit a job using the resource requirements
#$ -l h_vmem=5GB #$ -pe smp 8
I.e., by default you request a standard HERO node with 24GB of memory. However, note that your memory requirements amount to an overall number of 40GB, exeeding what a standard node offers by far. In such a case, the last line of the query qstat -j JOB_ID | tail -1 requesting the status of your job with integer specifier JOB_ID will read
cannot run in PE "smp" because it only offers 0 slots
which somehow describes what happens but does not allude to a solution to your problem. Now, as a remedy you could
- Check your memory requirements and request less than, say, 23GB memory so that the job can be executed on a standard node (bear in mind that not the full 24GB of a node can be requested for your job).
- Leave the memory requirements as they are and request one of the big nodes, offering 12 slots and 46GB of memory, see here. You might request such a big node using the additional option
#$ -l bignode=true
in your job submission script.
Tracking memory issues and memory consumption using valgrind
Before you actually compile your program and submit it via SGE, you might perform some checks on your local computer. For the purpose of memory checking there are many programs available. E.g., if your program runs long enough so that you can determine its process ID, you might for example use the top command to get an idea about its actual (momentary) memory consumption. However, there are more powerful tool that also allow you to determine whether your program exhibits memory leaks, invalid pointer use or things like that. Here, I will illustrate some basic usage of valgrind, a program that offers tools that, e.g., check the memory management of your code.
Therefore, consider the following example program (with annotated line numbers), called myExample_malloc.c:
1 #include <stdio.h> 2 #include <stdlib.h> 3 4 int main(int args, char *argv[]){ 5 int *myArray_a, *myArray_b, *myArray_c; 6 7 myArray_a=(int *) malloc(300*sizeof(int)); 8 myArray_b=(int *) malloc(1000*sizeof(int)); 9 free(myArray_b); 10 myArray_c=(int *) malloc(200*sizeof(int)); 11 free(myArray_a); 12 free(myArray_c); 13 14 return 0; 15 }
In brief: once compiled and invoked, it allocates three portions of heap memory and also frees the associated memory before exiting. As stated above, the program has no memory related issues. Howver, subsequently we will implement a couple of common errors in the above program so as to facilitate intuition on how the respective valgrind output might be interpreted.
Compile for elaborate debugging information
In order to be able to better interpret the information provided by valgrind it is good practice to compile your program so that it provides further debugging information. Using gcc this is done by adding the compiler option -g. For the above example you might type
gcc myExample_malloc.c -o myExample_malloc -g
to obtain the executable myExample_malloc, compiled using further debugging symbols.
Example 1: no memory issues
To cut a (rather) long story short, you might type
valgrind --tool=memcheck --leak-check=full ./myExample_malloc
to result in valgrind listing a summary of calls to malloc and free. Here, the output reads:
==3314== Memcheck, a memory error detector ==3314== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==3314== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info ==3314== Command: ./myExample_malloc ==3314== ==3314== ==3314== HEAP SUMMARY: ==3314== in use at exit: 0 bytes in 0 blocks ==3314== total heap usage: 3 allocs, 3 frees, 6,000 bytes allocated ==3314== ==3314== All heap blocks were freed -- no leaks are possible ==3314== ==3314== For counts of detected and suppressed errors, rerun with: -v ==3314== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4)
The above precise listing of the program reports no memory issue. The error summary lists 0 errors from 0 contexts, which is good. In general, note that the using valgrind will slow down the execution of the program considerably.:w
Example 2: non-freed memory
By intention, lets introduce a particular memory leak to the program myExample_malloc.c. Therefore, lets comment line number 11 which is responsible to free the memory associated with the array myArray_a:
11 // free(myArray_a);
Recompiling via
gcc myExample_malloc.c -o myExample_malloc -g
and, again, using the memcheck tool provided by valgrind via
valgrind --tool=memcheck --leak-check=full --show-reachable=yes ./myExample_malloc
yields the output
==1983== Memcheck, a memory error detector ==1983== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==1983== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info ==1983== Command: ./myExample_malloc ==1983== ==1983== ==1983== HEAP SUMMARY: ==1983== in use at exit: 1,200 bytes in 1 blocks ==1983== total heap usage: 3 allocs, 2 frees, 6,000 bytes allocated ==1983== ==1983== 1,200 bytes in 1 blocks are definitely lost in loss record 1 of 1 ==1983== at 0x4A0610C: malloc (vg_replace_malloc.c:195) ==1983== by 0x4005D4: main (myExample_malloc.c:7) ==1983== ==1983== LEAK SUMMARY: ==1983== definitely lost: 1,200 bytes in 1 blocks ==1983== indirectly lost: 0 bytes in 0 blocks ==1983== possibly lost: 0 bytes in 0 blocks ==1983== still reachable: 0 bytes in 0 blocks ==1983== suppressed: 0 bytes in 0 blocks ==1983== ==1983== For counts of detected and suppressed errors, rerun with: -v ==1983== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
Note that, in the error summary, valgrind reports 3 allocation process and only 2 free processes, leaving 1.200 bytes used at exit. In the leak summary below, they are signified as "definitely lost". Right before the leak summary, some details are given regarding which statement led to the lost memory. The respective snip reads
==1983== 1,200 bytes in 1 blocks are definitely lost in loss record 1 of 1 ==1983== at 0x4A0610C: malloc (vg_replace_malloc.c:195) ==1983== by 0x4005D4: main (myExample_malloc.c:7)
indicating that the lost memory was allocated by a call to malloc in line 7. And indeed, this is the line in the program myExample_malloc.c where the memory for the array myArray_a was allocated.
Example 3: invalid pointer use
In the program myExample_malloc.c, the array myArray_a is designed to hold a number of 300 values of type int which can be indexed using myArray_a[i] with i=0...299. Now, to illustrate how valgrind responds to the use of invalid heap memory lets introduce a further line after line 7 that tries to initialize a value at a location past the end of the respective array:
8 myArray_a[300]=7;
Recompiling via
gcc myExample_malloc.c -o myExample_malloc -g
and, again, using the memcheck tool provided by valgrind via
valgrind --tool=memcheck --leak-check=full --show-reachable=yes ./myExample_malloc
yields the output
==4184== Memcheck, a memory error detector ==4184== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==4184== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info ==4184== Command: ./myExample_malloc ==4184== ==4184== Invalid write of size 4 ==4184== at 0x4005E3: main (myExample_malloc.c:8) ==4184== Address 0x4c1e4f0 is 0 bytes after a block of size 1,200 alloc'd ==4184== at 0x4A0610C: malloc (vg_replace_malloc.c:195) ==4184== by 0x4005D4: main (myExample_malloc.c:7) ==4184== ==4184== ==4184== HEAP SUMMARY: ==4184== in use at exit: 0 bytes in 0 blocks ==4184== total heap usage: 3 allocs, 3 frees, 6,000 bytes allocated ==4184== ==4184== All heap blocks were freed -- no leaks are possible ==4184== ==4184== For counts of detected and suppressed errors, rerun with: -v ==4184== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
Note that, in the error summary, valgrind reports 1 error. Right before the error summary, some details are given regarding which statement led to the detected error:
==4184== Invalid write of size 4 ==4184== at 0x4005E3: main (myExample_malloc.c:8) ==4184== Address 0x4c1e4f0 is 0 bytes after a block of size 1,200 alloc'd ==4184== at 0x4A0610C: malloc (vg_replace_malloc.c:195) ==4184== by 0x4005D4: main (myExample_malloc.c:7)
Valgrind reports that an invalid writing operation of size 4 (corresponding to an integer on my desktop computer) occured at line 8 (which was the erroneous line we intentionally added to our code). In this manner, valgrind reports on the use of invalid heap memory.
Monitoring heap memory via massif
Subsequently the valgrind tool massif, a heap profiler that shows how much heap memory a given program actually uses, is illustrated. Therefore, consider the above program myExample_malloc.c in its unmodified form and compile it for further debugging information. Then, you might use valgrind by specifying the massif heap profiler in the form (the possible command line options can, e.g., be found here)
valgrind --tool=massif --time-unit=B ./myExample_malloc
This generate a human readable output file, here named massif.out.4941, which might be postprocessed according to
ms_print massif.out.4941
to yield more details of the heap memory consumption of the input program. The full output reads
-------------------------------------------------------------------------------- Command: ./myExample_malloc Massif arguments: --time-unit=B ms_print arguments: massif.out.4941 -------------------------------------------------------------------------------- KB 5.094^ ######################## | # | # | # | # | # | # | # | # | # | # | # | # | # :::::::: | # : | # : | ::::::::::::::::::::::::# ::::: | : # : : :::: | : # : : : | : # : : : 0 +----------------------------------------------------------------------->KB 0 11.77 Number of snapshots: 8 Detailed snapshots: [3 (peak)] -------------------------------------------------------------------------------- n time(B) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 0 0 0 0 0 0 1 1,208 1,208 1,200 8 0 2 5,216 5,216 5,200 16 0 3 5,216 5,216 5,200 16 0 99.69% (5,200B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->76.69% (4,000B) 0x4005E1: main (myExample_malloc.c:8) | ->23.01% (1,200B) 0x4005D3: main (myExample_malloc.c:7) -------------------------------------------------------------------------------- n time(B) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 4 9,224 1,208 1,200 8 0 5 10,032 2,016 2,000 16 0 6 11,240 808 800 8 0 7 12,048 0 0 0 0
In the first part, an ascii histogram is given that gives a visual account of the heap memory consumption. Therein, a vertical bar along the horizontal axis represents a measurement of the memory usage at a given point in time. The chosen time unit beeing bytes allocated/freed (indicated by the command line option --time-unit=B which especially fits short running programs; however, note that also other choices are possible). The memory peak is indicated by a bar composed of "#" symbols and below the plot, in the list of detailed snapshots, snapshot number 3 is signified as peak. In the following list of snapshot-details one can see that for snapshot 3, the "useful-heap" memory allocated amounts to 5.200B. This makes sense, since an integer needs 4B, and in our program the memory peak is given by arrays myArray_a (300*4B) and myArray_b (1000*4B) beeing allocated at once. Also, note that the whole sequence of "useful-heap" values might be reconstructed by reading through the program myExample_malloc.c.
Monitoring heap and stack memory via massif
Note that valgrind can also be used to monitor not only the heap memory, i.e. the memory addressed by dynamic allocation, but also the stack memory, from which, e.g., the memory for local data is taken. To enable this the command line option --stack=yes needs to be set. For the above example this then reads
valgrind --tool=massif --stack=yes --time-unit=B ./myExample_malloc
yielding a massif data file, here named massif.out.7450, which might be postprocessed according to
ms_print massif.out.7450
to result in more details of the heap and stack memory consumption of the input program. The full output reads is of course more detailed than the one shown in the preceeding paragraph. Here, only the memory consumption graph is shown for an illustrative purpose:
-------------------------------------------------------------------------------- Command: ./myExample_malloc Massif arguments: --stacks=yes --time-unit=B ms_print arguments: massif.out.7450 -------------------------------------------------------------------------------- KB 6.078^ ## | # | # | # | # | # | # | # | @ # | @::@ # | @ :@ # | : ::@ :@ # | : : @ :@: # | : : @ :@:: # | : : @ :@:: : :: ::: :::: :: :::::::: @@:::@::: # | ::: @ :@:: ::::::::: :::::: :: ::::::: :: :@ :::@:: :# : | :::: @ :@:::@: ::::::: :: ::: :: : ::::: :: :@ :::@:: :# : | :::: @ :@:::@: ::::::: :: ::: :: : ::::: :: :@ :::@:: :# :: | :::: @ :@:::@: ::::::: :: ::: :: : ::::: :: :@ :::@:: :# :: | :::: @ :@:::@: ::::::: :: ::: :: : ::::: :: :@ :::@:: :::::::# ::::::@: 0 +----------------------------------------------------------------------->KB 0 216.2 Number of snapshots: 66 Detailed snapshots: [5, 8, 12, 37, 41, 51, 52, 53 (peak), 63]
Note that now, the memory peak, again given by arrays myArray_a (300*4B) and myArray_b (1000*4B) beeing allocated at once, corresponds to snapshot number 53. Also note that most stack memory operations occur before the actual allocation/deallocation processes in the main routine of the programm are performed.
In this way, valgrind might be used to guide you in the process of requesting memory resources for the jobs you aim to submit.
Debugging
Details on how to debug malfunctioning programs on the HPC system can be found here.
Profiling
Details on how to profile your program using the GNU profiling tool gprof (and sprof in case you use want to profile code using shared libraries) can be found here.
Monitoring the current resource usage for a running job
This section will illustrate how to use the SGE command qrsh to start an interactive session to determine the actual, current resource usage of a running job. Most likely the resources actually consumed by your application will differ from those that were allocated upon submission. The procedure outlined below might e.g. give you an idea how much memory your application really consumes. This in turn might help you to estimate more fitting resource requirements for future job submissions.
NOTE: The job I used to illustrate this procedure was not owned by me. In advance to writing this section I asked user diab3109 (i.e. the owner of the job) for permission.
Motivation
Consider a situation where you have a running job and you are interested in the precise status of the job. First you might use the SGE command qstat in order to obtain some details about the principal state of the job and the host it was scheduled to run on:
> qstat job-ID prior name user state submit/start at queue slots ja-task-ID --------------------------------------------------------------------------------------------------------- 1097375 0.50500 n5E4_0_4 diab3109 r 11/15/2013 11:27:33 mpc_std_long.q@mpcs008 1
Now, just for arguments, say you are interested in the amount of memory your application currently consumes. For the subsequent steps it is important to first determine the execution host that handles your computing request. To get a grip on this issue have a look at the queue instance that was feasible to run the job. Here, the queue instance reads mpc_std_long.q@mpcs008. In particular this tells that the application is running on host mpcs008. The remaining steps can be summarized as follows:
- logon to the computing node that hosts your job
- use the top command to filter for the process ID that corresponds to your job
- list e.g. the status file indexed by this process ID
Start interactive session to logon to execution host
In order to proceed you need to submit an interactive session to SGE via the command qrsh. This will direct you to one of the hosts that are feasible to run interactive sessions through the particular queue mpc_xtr_ctrl.q. To be more precise, the details of this queue read
> qconf -sq mpc_xtr_ctrl.q qname mpc_xtr_ctrl.q hostlist @mpcx seq_no 12700,[mpcs125.mpinet.cluster=12725], \ [mpcs126.mpinet.cluster=12726], \ [mpcs127.mpinet.cluster=12727], \ [mpcs128.mpinet.cluster=12728], \ [mpcs129.mpinet.cluster=12729], \ [mpcs130.mpinet.cluster=12730] load_thresholds NONE suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype INTERACTIVE ckpt_list NONE pe_list NONE rerun FALSE slots 12 tmpdir /scratch shell /bin/bash prolog NONE epilog NONE shell_start_mode posix_compliant starter_method NONE suspend_method NONE resume_method NONE terminate_method NONE notify 00:00:60 owner_list NONE user_lists herousers xuser_lists NONE subordinate_list NONE complex_values NONE projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt 00:10:00 s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core INFINITY s_rss INFINITY h_rss INFINITY s_vmem INFINITY h_vmem 50M
Hence, you will be directed to one of the hosts mpc125 through mpc130. From the keywords h_vmem and h_rt it is evident that such an interactive session will be limited in memory to 50Mb and is only allowed to run for 10 minutes, respectively. To start the interactive session you simply need to type qrsh. Note that since the requested interactive session will be recognized by SGE it might take a short while until the session is granted:
alxo9476@hero01:~$ Last login: Tue Nov 26 12:39:01 2013 from hero01.mpinet.cluster alxo9476@mpcs125:~$
Apparently I was directed to the particular node mpcs125. From any terminal opened on one of the submission hosts you can verify that the interactive session is recognized by SGE:
alxo9476@hero01:~$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID --------------------------------------------------------------------------------------------------------- 1105004 0.50500 QRLOGIN alxo9476 r 11/26/2013 12:50:12 mpc_xtr_ctrl.q@mpcs125 1
Now, lets logon to the computing node that hosts the job we are interested to monitor, i.e. node mpcs008. Therefore you might simply type
alxo9476@mpcs125:~$ ssh mpcs008 Last login: Tue Nov 26 12:56:10 2013 from mpcs125.cm.cluster alxo9476@mpcs008:~$
If you logon to the node for the very first time you will get a message similar to Warning: Permanently added 'mpcs008,10.141.3.8' (RSA) to the list of known hosts.
Use top to filter for a jobs process ID
Once you are logged in to the execution host that processes the job you are interested in, you can use the common unix tool top to filter for that job. You can precisely filter for those processes that belong to a particular owner by specifying his/her user name using the -u option similar to:
alxo9476@mpcs008:~$ top -u "diab3109" top - 13:06:52 up 84 days, 22:06, 1 user, load average: 10.30, 9.90, 9.86 Tasks: 245 total, 11 running, 234 sleeping, 0 stopped, 0 zombie Cpu(s): 78.8%us, 0.8%sy, 0.0%ni, 15.7%id, 4.4%wa, 0.1%hi, 0.2%si, 0.0%st Mem: 24659208k total, 23530528k used, 1128680k free, 19364k buffers Swap: 1998840k total, 22244k used, 1976596k free, 13396936k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10210 diab3109 25 0 54112 31m 828 R 100.1 0.1 15936:50 n5E4_0_4.out 10191 diab3109 22 0 84092 2492 1892 S 0.0 0.0 0:00.00 bash
You can get a more elaborate display of the COMMAND that corresponds to a given process by simply typing c. Here the list of processes then elaborates to
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10210 diab3109 25 0 54112 31m 828 R 100.1 0.1 15938:57 ./n5E4_0_4.out 10191 diab3109 22 0 84092 2492 1892 S 0.0 0.0 0:00.00 -bash /cm/shared/apps/sge/current/default/spool/mpcs008/job_scripts/1097375
The first process (PID: 10210) refers to the binary that is executed and the second process (PID: 10191) details the job submission script (residing in the spool directory of the execution host) that was used to startup the job. You might filter that submission script to see which resources where requested during the submission procedure. E.g. if you are iterated in the precise values for h_rt and h_vmem that where allocated you might type
alxo9476@mpcs008:~$ cat /cm/shared/apps/sge/current/default/spool/mpcs008/job_scripts/1097375 | grep "#$ -l" #$ -l longrun=true #$ -l h_vmem=500M #$ -l h_fsize=200M
So, apparently 500Mb where allocated for the job and the resource option longrun was set to true (explaining why the job was scheduled for an instance of the queue mpc_std_long.q).
At times it is also interesting to see on which CPU a particular process runs. Therefore, within an active top session you might simply type in the sequence f j <space> to get an individual row (headed by the letter P) that specifies the CPU for each process. Additionally, by simply typing 1 you get an account of the current usage of all available CPUs. The result might look as listed below:
top - 13:19:07 up 84 days, 22:18, 1 user, load average: 10.11, 10.17, 10.04 Tasks: 245 total, 11 running, 234 sleeping, 0 stopped, 0 zombie Cpu0 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 97.3%us, 2.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 97.7%us, 2.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 97.0%us, 3.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 0.0%us, 0.7%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 24659208k total, 23519928k used, 1139280k free, 16752k buffers Swap: 1998840k total, 22244k used, 1976596k free, 12603656k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 10210 diab3109 25 0 54112 31m 828 R 100.1 0.1 15949:05 10 n5E4_0_4.out 10191 diab3109 22 0 84092 2492 1892 S 0.0 0.0 0:00.00 7 bash
Lets briefly clarify the different fields of the CPU usage lines. E.g., CPU 10 (the one hosting process 10210) currently shows
Cpu10 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
From left to right the individual fields read:
- %us = CPU usage by processes from user
- %sy = CPU usage by system and kernel processes
- %ni = CPU usage by processes without standard priority (ni stands for nice)
- %id = CPU is idle
- %wa = time during which CPU had to wait due to I/O access
- %hi = CPU usage due to hardware interrupts
- %si = CPU usage due to software interrupts
To support intuition: a high value of %us and a small value of %id are good. For the job we are interested in the respective values read 100%us and 0%id, respectively. In contrast to this, a large value of %wa would e.g. point out a high disk activity which could be due to extensive swapping.
List a jobs status file
So as to get an idea about the current memory usage of your running job and the maximal amount of memory used so far, you can use the process id (PID) of a job and list its status file located in the standard /proc folder (see here) on the execution host. Within that folder, the status file can be found in a subfolder named after the PID, see:
alxo9476@mpcs008:~$ cat /proc/10210/status Name: n5E4_0_4.out State: R (running) SleepAVG: 0% Tgid: 10210 Pid: 10210 PPid: 10191 TracerPid: 0 Uid: 20438 20438 20438 20438 Gid: 12400 12400 12400 12400 FDSize: 256 Groups: 12000 12400 12402 21000 50126 VmPeak: 54112 kB VmSize: 54112 kB VmLck: 0 kB VmHWM: 31860 kB VmRSS: 31832 kB VmData: 43592 kB VmStk: 88 kB VmExe: 36 kB VmLib: 2168 kB VmPTE: 108 kB StaBrk: 03066000 kB Brk: 03087000 kB StaStk: 7fffffffe140 kB Threads: 1 SigQ: 0/212992 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 00000000000040b6 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00ffffff Mems_allowed: 00000000,00000003
This status file simply is a summary of stat (yielding the process status) and statm (yielding further information about the process status) in human readable form. An explanation of all the keywords listed therein is provided here. Most importantly, the peak memory usage and the current memory usage are summarized by the keywords VmPeak and VmSize. Here, the job uses approximately 54Mb, which is way less than the amount of 500Mb allocated during the submission procedure. However, the allocated amount of 500Mb is still smaller than the default value of 1.2Gb. A job as slim as this one barely blocks any resources that might be needed by other users. Hence, the observed discrepancy between allocated memory and actually used memory can be tolerated without complaints.
Once you are done, you might simply log out from the execution host using exit. This will return you to the host you where directed to upon submission of the interactive session (here: mpcs125). Exiting again will return you to the submission host you initially started from.