Difference between revisions of "Brief Introduction to HPC Computing"
Line 71: | Line 71: | ||
./myExample | ./myExample | ||
</nowiki> | </nowiki> | ||
The resource allocation statements that are used in the job submission script above are explained [[SGE_Job_Management_(Queueing)_System#Overview_of_available_options|here]]. | |||
The script is submitted by means of | |||
qsub myProg.sge | |||
which | |||
== Using local (scratch) storage for I/O intense serial jobs == | == Using local (scratch) storage for I/O intense serial jobs == |
Revision as of 16:35, 15 May 2013
A brief introduction to the usage of the HPC facilities, targeted at new and unexperienced HPC users is given below. The introduction is based on various minimal examples that illustrate how to compile serial and parallel programs as well as how to submit and monitor actual jobs using SGE.
A simple serial program
Example using the GNU compiler collection (gcc)
Consider the following "Hello World!" C program, called myExample.c:
#include <stdio.h> int main(int argc, char *argv[]) { fprintf(stdout,"Hello World!\n"); return 0; }
In brief: once compiled an invoked, it only prints the string "Hello World!" to the standard out-stream.
Compiling the program
Once you log in to the system, you are on one of the two nodes hero01 and hero02. This is where you should compile your programs and from where you should submit your jobs from. After the log in, the GNU compiler collection is loaded by default. However, we will explicitly go through the steps needed to load a certain compiler, for that matter. Therefore, lets pretend the compiler we need is not loaded already. In order to be able to use a certain compiler one needs to load the respective user environment, specified by a particular module.
To get a list of the modules which are loaded currently, just type
module list
For me, this yields this triggers the output
Currently Loaded Modulefiles: 1) shared 2) sge/6.2u5p2
Further, to get a list of all available gcc related modules, you might type
module avail gcc
to obtain
------------------- /cm/shared/modulefiles -------------------- gcc/4.3.4 gcc/4.6.3 gcc/4.7.1
In the subsequent example we will use gcc/4.7.1 to compile the program above. More information about that particular module can be obtained by typing
module show gcc/4.7.1
Finally, to load the module just type
module load gcc/4.7.1
which creates the desired user environment. You can check whether the proper compiler is loaded by typing which gcc. This now yields
/cm/shared/apps/gcc/4.7.1/bin/gcc
Hence everything worked well and the stage is properly set in order to compile the example program by means of the statement
gcc myExample.c -o myExample
Submitting a job
In order to submit the job via SGE, you might use the following job submission script, called myProg.sge:
#!/bin/bash ####### which shell to use #$ -S /bin/bash ####### change to directory where job was submitted from #$ -cwd ####### maximum walltime of the job (hh:mm:ss) #$ -l h_rt=0:10:0 ####### memory per job slot #$ -l h_vmem=300M ####### disk space #$ -l h_fsize=100M ####### name of the job #$ -N basic_test ./myExample
The resource allocation statements that are used in the job submission script above are explained here. The script is submitted by means of
qsub myProg.sge
which
Using local (scratch) storage for I/O intense serial jobs
Consider a situation where your particular application is rather I/O intense so that the speed of your program suffers from the amount of I/O operations that strain the global file system. Examples might be irregular I/O patterns at a fast pace or an application that has to create, open, close and delete many files. As a remedy in order to overcome such problems you might benefit from using a local scratch disk of an execution host on which your program is actually run. This reduces the amount of network traffic and hence reduces the strain on the global file system. The subsequent example illustrates how to access and use the local storage on a given host for the purpose of storing data during the runtime of the program. In the example, after the program terminates, the output data is copied to the working directory from which the job was submitted from and the local file system on the host is cleaned out. For this matter, consider the examplary C program myExample_tempdir.c
#include <stdio.h> int main(int argc, char *argv[]) { FILE *myFile; myFile=fopen("my_data/myData.out","w"); fprintf(myFile,"Test output to local scratch directory\n"); fclose(myFile); }
which, just for arguments (and to fully explain the job submission script below), is contained in the working directory
$HOME/wmwr/my_examples/tempdir_example/
The program assumes that there is a directory my_data in the current working directory to which the file myData.out with a certain content (here the sequence of characters Test output to local scratch directory) will be written.
In oder to compile the program via the current gcc compiler, you could first set the stage by loading the proper modules, e.g.,
module clear module load sge module load gcc/4.7.1
and then compile via
gcc myExample_tempdir.c -o myExample_tempdir
to yield the binary myExample_tempdir.
At this point bear in mind that we do not want to execute the binary by hand right away! Instead, we would like to leave it to SGE to determine a proper queue instance (guided by the resources we subsequently will specify for the job) on a host with at least one free slot, where the job will be executed. A proper job submission script, here called myProg_tempdir.sge, that takes care of creating the folder my_data needed by the program myExample_tempdir in order to store its output in a temporal directory on the executing host reads
#!/bin/bash ####### which shell to use #$ -S /bin/bash ####### change to directory where job was submitted from #$ -cwd ####### maximum walltime of the job (hh:mm:ss) #$ -l h_rt=0:10:0 ####### memory per job slot #$ -l h_vmem=100M ####### since working with local storage, no need to request disk space ####### name of the job #$ -N tmpdir_test ####### change current working directory to the local /scratch/<jobId>.<x>.<qInst> ####### directory, available as TMPDIR on the executing host with HOSTNAME cd $TMPDIR ####### write details to <jobName>.o<jobId> output file echo "HOSTNAME = " $HOSTNAME echo "TMPDIR = " $TMPDIR ####### create output directory on executing host (parent folder is TMPDIR) mkdir my_data ####### run program $HOME/wmwr/my_examples/tempdir_example/myExample_tempdir ####### copy the output to the directory the job was submitted from cp -a ./my_data $HOME/wmwr/my_examples/tempdir/
Note that in the above job submission script there is no need to request disk space by setting the resource h_fsize since we are working with local storage provided by the execution host. Submitting the script via
qsub myProg_tempdir.sge
enqueues the respective job, here having jobId 703914. After successful termination of the job, the folder my_data is moved to the working directory from which the job was originally submitted from. Also, the two job status files tmpdir_test.e703914 and tmpdir_test.o703914 where created that might contain further details associated with the job. The latter file should contain the name of the host on which the job actually ran and the name of the temporal directory. And indeed, cat tmpdir_test.o703914 reveals the file content
HOSTNAME = mpcs001 TMPDIR = /scratch/703914.1.mpc_std_shrt.q
Further, the file my_data/myData.out contains the line
Test output to local scratch directory
as expected. Note that the temporary directory $TMPDIR (here: /scratch/703914.1.mpc_std_shrt.q) on the execution host (here: mpcs001) is automatically cleaned out. Finally, note that since $TMPDIR is created on a single host, the procedure outlined above works well only if your application runs on a single host. I.e., it is feasible for jobs that either request only a single slot (i.e. non-parallel jobs) or for parallel jobs for which all requested slots fit onto the same host. However, due to the "fill up" allocation rule obeyed by SGE, this cannot be guaranteed in general.
A simple parallel program
In order to illustrate how to compile, submit and monitor a basic parallel C program that uses the Message Passing Interface (MPI), consider the following code contained in the file myHelloWorld_mpi.c:
#include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int numprocs, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); printf("Process %d (out of %d) on host %s\n",rank, numprocs, processor_name); MPI_Finalize(); }
To sum it up: if the program is compiled, requested to use a number of, say, M slots and finally runs, each process writes out a certain statement that specifies its rank and the name of its execution host to the standard out-stream.
Note that, in order to compile and submit the program, you might use one out of a variety of parallel environments (PEs). You can get an overview of all possible PEs by typing
qconf -spl
which, however, is not listed here. Out of the many possible choices (some of them fit rather special needs for certain applications which a typical HPC user does not even have to worry about) we will subsequently consider the PEs openmpi and impi in more detail.
Example using the openmpi parallel environment
So as to use the openmpi parallel environment, you need to load the proper modules prior to compiling the above program (see here). To see which openmpi-modules are available you might type module avail openmpi, yielding
------------------- /cm/shared/modulefiles -------------------- openmpi/1.4.3/gcc/64/4.3.4 openmpi/1.4.3/intel/ics/2011.0.013/64 openmpi/1.4.3/intel/ics/64/2011.0.013 openmpi/1.6.2/gcc/64/4.7.1 openmpi/1.6.2-debug/gcc/64/4.7.1 openmpi/gcc/64/1.4.2 openmpi/gcc/64/1.4.3 openmpi/intel/64/10.1.015/1.4.5 openmpi/intel/compiler/64/10.1.015/1.4.5 openmpi/intel/ics/64/2011.0.013/1.4.3 openmpi/open64/64/1.4.2
Subsequently we will use openmpi version 1.6.2, built with gcc version 4.7.1.
Compiling the program
To load the proper modules you might type:
module unload gcc module load gcc/4.7.1 module load openmpi/1.6.2/gcc/64/4.7.1
Note that it might be necessary to unload other modules that result in a conflict, first! More information on module details and possible conflicts can be found by typing, e.g.,
module show openmpi/1.6.2/gcc/64/4.7.1
Also, a list of all currently loaded modules is available by typing
module list
Once the proper modules are loaded you might proceed to compile the program via
mpicc myHelloWorld_mpi.c -o myHelloWorld_openMpi
You might first check whether mpicc indeed refers to the desired compiler by typing which mpicc, which in this case yields
/cm/shared/apps/openmpi/1.6.2/gcc/64/4.7.1/bin/mpicc
so everything is fine and the stage is properly set!
Submitting a job
In order to submit the job via SGE, specifying a parallel environment (PE) that fits your choice (here: openMpi), you might use the following job submission script, called myProg_openMpi.sge:
#!/bin/bash ####### which shell to use #$ -S /bin/bash ####### change to directory where job was submitted from #$ -cwd ####### maximum walltime of the job (hh:mm:ss) #$ -l h_rt=0:10:0 ####### memory per job slot #$ -l h_vmem=1000M ####### disk space #$ -l h_fsize=1G ####### which parallel environment to use, and number of slots #$ -pe openmpi 13 ####### enable resource reservation (to prevent starving of parallel jobs) #$ -R y ####### name of the job #$ -N openMpi_test module unload gcc module load gcc/4.7.1 module load openmpi/1.6.2/gcc/64/4.7.1 mpirun --mca btl ^openib,ofud -machinefile $TMPDIR/machines -n $NSLOTS ./myHelloWorld_openMpi
Most of the resource allocation statements should look familiar to you (if not, see here). However, note that a few of them are required to ensure a proper submission of parallel jobs. E.g., you need to take care to use the proper PE: in the job submission script this is done by means of the statement
#$ -pe <parallel_environment> <num_slots>
wherein <parallel_environment> refers to the type of PE that fits your application and where <num_slots> specifies the number of desired slots for the parallel job. Here we decided to use openMpi, hence, the proper PE in reads openmpi. Further, in the above example a number of 13 slots is requested.
Now, typing
qsub myProg_openMpi.sge
ensues the job, assigning the jobId 704398 in my case.
Checking the status of the job
Once the job starts to run, it is possible to infer from which hosts the 13 requested slots are accumulated by typing qstat -g t, which in my case yields
job-ID prior name user state submit/start at queue master ja-task-ID ---------------------------------------------------------------------------------------------------------- 704398 0.50735 openMpi_te alxo9476 r 05/15/2013 09:54:23 mpc_std_shrt.q@mpcs002 MASTER mpc_std_shrt.q@mpcs002 SLAVE mpc_std_shrt.q@mpcs002 SLAVE 704398 0.50735 openMpi_te alxo9476 r 05/15/2013 09:54:23 mpc_std_shrt.q@mpcs004 SLAVE mpc_std_shrt.q@mpcs004 SLAVE mpc_std_shrt.q@mpcs004 SLAVE mpc_std_shrt.q@mpcs004 SLAVE mpc_std_shrt.q@mpcs004 SLAVE 704398 0.50735 openMpi_te alxo9476 r 05/15/2013 09:54:23 mpc_std_shrt.q@mpcs006 SLAVE mpc_std_shrt.q@mpcs006 SLAVE 704398 0.50735 openMpi_te alxo9476 r 05/15/2013 09:54:23 mpc_std_shrt.q@mpcs008 SLAVE mpc_std_shrt.q@mpcs008 SLAVE mpc_std_shrt.q@mpcs008 SLAVE mpc_std_shrt.q@mpcs008 SLAVE
Meanwhile the job has terminated successfully, there where 4 files created: openMpi_test.e704398, openMpi_test.o704398, openMpi_test.pe704398 and openMpi_test.po704398. In detail they contain:
- openMpi_test.po704398: the hostfile for the job which can be found in the spool directory for the MASTER process (which in this case is mpcs002), reading
-catch_rsh /cm/shared/apps/sge/current/default/spool/mpcs002/active_jobs/704398.1/pe_hostfile mpcs002.mpinet.cluster mpcs002.mpinet.cluster mpcs004.mpinet.cluster mpcs004.mpinet.cluster mpcs004.mpinet.cluster mpcs004.mpinet.cluster mpcs004.mpinet.cluster mpcs006.mpinet.cluster mpcs006.mpinet.cluster mpcs008.mpinet.cluster mpcs008.mpinet.cluster mpcs008.mpinet.cluster mpcs008.mpinet.cluster
- openMpi_test.pe704398: nothing (which is good!)
- openMpi_test.o704398: the (expected) program output, reading
Process 7 (out of 13) on host mpcs006 Process 8 (out of 13) on host mpcs006 Process 6 (out of 13) on host mpcs004 Process 3 (out of 13) on host mpcs004 Process 5 (out of 13) on host mpcs004 Process 2 (out of 13) on host mpcs004 Process 4 (out of 13) on host mpcs004 Process 10 (out of 13) on host mpcs008 Process 12 (out of 13) on host mpcs008 Process 9 (out of 13) on host mpcs008 Process 11 (out of 13) on host mpcs008 Process 1 (out of 13) on host mpcs002 Process 0 (out of 13) on host mpcs002
- openMpi_test.e704398: if there are N hosts involved to run your application (here: N=4), there should be N-1 harmless error messages (each consisting of two lines) of the form
bash: module: line 1: syntax error: unexpected end of file bash: error importing function definition for `module' bash: module: line 1: syntax error: unexpected end of file bash: error importing function definition for `module' bash: module: line 1: syntax error: unexpected end of file bash: error importing function definition for `module'
This is a harmless, well known and documented error for the SGE version (6.2u5) used on the local HPC facilities (see here) which you might safely ignore.
Example using the impi parallel environment
So as to use the impi parallel environment, you need to load the proper modules prior to compiling the above program (see here). To see which intel mpi modules are available you might type module avail intel/impi, yielding
------------------- /cm/shared/modulefiles -------------------- intel/impi/32/4.0.1.007 intel/impi/4.0.1.007/64 intel/impi/32/4.1.0.024 intel/impi/64/4.0.1.007 intel/impi/4.0.1.007/32 intel/impi/64/4.1.0.024
Subsequently we will use the 64-bit intel Mpi library with the environment as specified by the module intel/impi/64/4.1.0.024.
Compiling the program
To load the proper modules you might type:
module unload gcc module unload openmpi module load intel/impi/64/4.1.0.024
Note that it might be necessary to unload other modules that result in a conflict, first! More information on module details and possible conflicts can be found by typing, e.g.,
module show intel/impi/64/4.1.0.024
Also, a list of all currently loaded modules is available by typing
module list
Once the proper modules are loaded you might proceed to compile the program via
mpicc myHelloWorld_mpi.c -o myHelloWorld_intelMpi
You might first check whether mpicc indeed refers to the desired compiler by typing which mpicc, which in this case yields
/cm/shared/apps/intel/ics/2013.0.028/impi/4.1.0.024/intel64/bin/mpicc
so everything is fine and the stage is properly set!
Submitting the program
In order to submit the job via SGE, specifying a parallel environment (PE) that fits your choice (here: impi), you might use the following job submission script, called myProg_intelMpi.sge:
#!/bin/bash ####### which shell to use #$ -S /bin/bash ####### change to directory where job was submitted from #$ -cwd ####### maximum walltime of the job (hh:mm:ss) #$ -l h_rt=0:10:0 ####### memory per job slot #$ -l h_vmem=1000M ####### disk space #$ -l h_fsize=1G ####### which parallel environment to use, and number of slots #$ -pe impi 13 ####### enable resource reservation (to prevent starving of parallel jobs) #$ -R y ####### name of the job #$ -N intelMpi_test module unload gcc module unload openmpi module load intel/impi/64/4.1.0.024 mpirun -machinefile $TMPDIR/machines -n $NSLOTS ./myHelloWorld_impi
Most of the resource allocation statements should look familiar to you (if not, see here). However, note that a few of them are required to ensure a proper submission of parallel jobs. E.g., you need to take care to use the proper PE: in the job submission script this is done by means of the statement
#$ -pe <parallel_environment> <num_slots>
wherein <parallel_environment> refers to the type of PE that fits your application and where <num_slots> specifies the number of desired slots for the parallel job. Here we decided to use impi, hence, the proper PE in reads impi. Further, in the above example a number of 13 slots is requested.
Now, typing
qsub myProg_intelMpi.sge
ensues the job, assigning the jobId 704648 in my case.
The parallel environment memory issue
As pointed out above, after a job has finished one can obtain further information about the resources actually required by the job by using the qacct utility. The only thing one has to provide is the Id of the job. To point out one specific issue of using PEs (in particular on HERO), reconsider the openmpi example above. In that example, the jobId provided by SGE was 704398. Typing qacct -j 704398 yields a list of resources actually used by the application:
============================================================== qname mpc_std_shrt.q hostname mpcs002.mpinet.cluster group ifp owner alxo9476 project NONE department defaultdepartment jobname openMpi_test jobnumber 704398 taskid undefined account sge priority 0 qsub_time Wed May 15 09:53:55 2013 start_time Wed May 15 09:54:24 2013 end_time Wed May 15 09:54:36 2013 granted_pe openmpi slots 13 failed 0 exit_status 0 ru_wallclock 12 ru_utime 0.602 ru_stime 0.459 ru_maxrss 20976 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 51650 ru_majflt 523 ru_nswap 0 ru_inblock 0 ru_oublock 0 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 14798 ru_nivcsw 1347 cpu 1.061 mem 0.030 io 0.002 iow 0.000 maxvmem 775.348M arid undefined
Consider the value of maxvmem used by the job. On the first sight this seems a bit odd, given that the application was a simple "hello world" program! However, if the processes that belong to a job (here: 13 processes) are distributed over several hosts (here: 4 hosts), the MASTER process has to handle all the SLAVE processes. Therefore it has to set up and maintain a connection to all the remote hosts which definitely costs some memory (easily 150 to 200M per host). However, note that these memory requirements accumulate for the MASTER process only, the SLAVE processes need less memory. Therefore, if one submits a large parallel job which might be executed on several hosts one has to make sure that the MASTER process does not run out of resources, therefore one has to allocate sufficient memory. Otherwise the job will be killed.