Difference between revisions of "MATLAB Distributing Computing Server"
Line 32: | Line 32: | ||
::: ''Cluster nodes' OS'': Select "unix" | ::: ''Cluster nodes' OS'': Select "unix" | ||
::: ''Function called when destroying a job'': <tt>destroyJobFcn</tt> | ::: ''Function called when destroying a job'': Enter <tt>destroyJobFcn</tt> | ||
::: ''Function called when getting the job state'': <tt>getJobStateFcn</tt> | ::: ''Function called when getting the job state'': Enter <tt>getJobStateFcn</tt> | ||
::: ''Job data location is accessible from both client and cluster nodes'': Select "False" | ::: ''Job data location is accessible from both client and cluster nodes'': Select "False" |
Revision as of 05:40, 1 May 2013
Benefits of MDCS
With MDCS, you can submit both serial as well as parallel jobs to one of the central HPC clusters from within your local MATLAB session. There is no need to deal with SGE, or even to log-on to the HPC systems. The internal structure of the clusters is hidden to the end user, they merely act as "black box" which is connected to your local machine and provides you with powerful computational resources. On the other hand, jobs submitted via MDCS are fully integrated into SGE, and have the same rights and privileges like any other SGE batch job. Thus there is no conflict between MATLAB jobs submitted via MDCS and standard SGE batch jobs.
Using MDCS for MATLAB computations on the central HPC facilities has a number of advantages, e.g.:
- Simplified workflow for those who exclusively do their numerics with MATLAB: they can do development, production of results, and post-processing within a unified environment (the MATLAB desktop).
- A "worker" (= MATLAB session without a user interface) does not check out any "regular" MATLAB license or, what is even more, any Toolbox license even if functions or utilities of the Toolbox are used by the worker; all Toolboxes to which the client which the job was submitted from has access to (regardless whether they are actually checked out by the client or not) can be used by the workers.Considering that the University has only 200 MATLAB licenses, but there are 224 MDCS worker licenses, it immediately becomes clear that the total number of MATLAB licenses for all users in the University is effectively more than doubled! The effect is even more pronounced for the Toolboxes: e.g., there are only 50 licenses for the Statistics Toolbox, but with MDCS an additional "effective" 224 licenses for this Toolbox become available (analogous for all other toolboxes).
To allow for a fair sharing of resources, the number of worker licenses a single user can check out at a given instance has been limited to 36. This should be compared to the situation before the introduction of MDCS: how often could one user get access to 36, say, Statistics Toolbox or Signal Processing Toolbox licenses at a time? At peak times, all such licenses are usually checked out.) - The Parallel Computing Toolbox on your local machine only allows you to use a maximum of 12 workers simultaneously. With MDCS, you can define parallel jobs using more than 12 workers and running across different hosts (compute nodes). Moreover, the Parallel Computing Toolbox provides utilities which simplify the parallelization of MATLAB code. E.g., for "embarrassingly parallel" (aka task-parallel) problems, which are quite common in practice (parameter sweeps, data analyses where the same operations are performed on a set of data, etc.), the parfor loop and other tools allow for an easy and rather efficient parallelization. Similarly, for communicating (aka data-parallel) jobs, which are the MATLAB analogue of MPI jobs (implementing the "single program, multiple data" paradigm), tools for "automated" parallelization are available, too, in particular for Linear Algebra operations (distributed and co-distributed arrays). In conjunction with MDCS and the possibility to use the powerful cluster resources, this can lead to a significant speed-up of your MATLAB computations.
Therefore, all MATLAB computations on the clusters should by default be done via MDCS. Of course, it is still possible to submit MATLAB Jobs as regular SGE jobs (by writing the command that starts MATLAB in non-interactive mode into the submission script and supplying the necessary input arguments and files), but this strategy is deprecated. It leads, due to the limited number of available MATLAB and Toolbox licenses, to a strong competition between HPC users and other MATLAB users across the University. Moreover, cluster jobs often fail immediately after they have started to run since there are no free licenses available (a reliable, full license integration with SGE is non-trivial to implement). With MDCS, this "license availability problem" does not exist since SGE keeps track of the total number of workers currently used, and if the required resources are not available, the job just stays in the queue like any other batch job.
To use any of the functionalities of MDCS, you need to have the Parallel Computing Toolbox (PCT) installed on your local system.
Prerequisites
Being able to use MDCS and to submit jobs to the cluster from within a local MATLAB session requires a few preparations, both on a system-wide (host) basis and on the per-user level:
- On your local machine (PC, workstation, notebook, ...), you must have one of the following MATLAB releases installed:
- R2010b
- R2011a
- R2011b
Unfortunately, our current license status does not allow us to install more recent releases of MDCS. It is important that your local installation includes all toolboxes, in particular the Parallel Computing Toolbox (PCT). It is recommended to install MATLAB from the media provided by the IT Services on their website and follow the instructions there. In that case, all required components (including the PCT) are automatically installed. - Your local machine must be able to connect to the login nodes of the clusters (flow.uni-oldenburg.de or hero.uni-oldenburg.de) via ssh.
- The system administrator responsible for your local machine must install a couple of files into the directory
matlabroot/toolbox/local
on a Unix/Linux system, or the analogous location for Windows clients. These are the SGE integration files provided by MDCS, which have been customized for our clusters. The names of the files are:
createSubmitScript.m
destroyJobFcn.m
distributedJobWrapper.sh
distributedSubmitFcn.m
extractJobId.m
getJobStateFcn.m
getRemoteConnection.m
getSubmitString.m
parallelJobWrapper.sh
parallelSubmitFcn.m
startup.m
The required SGE integration files can be downloaded as a single zipped archive. The only file that needs modification is startup.m. There, the fully-qualified hostname of your local machine must be substituted. - The following steps must be completed by each user. Start a new MATLAB session and bring up the Parallel Configurations Manager by selecting Parallel -> Manage Configurations. Create a new "Generic Scheduler" Configuration (File -> New -> Generic):
The following fields and boxes must be filled out in the register card "Scheduler"
- Configuration Name: MATLAB name of the configuration (e.g., HERO or FLOW, but could be any name)
- Root folder of MATLAB installation for workers: E.g., /cm/shared/apps/matlab/r2011b if you have R2011b installed on your local machine
- Folder where job data is stored: A directory on your local machine, typically inside your homedirectory, where you read and write access (e.g., /home/myaccount/MATLAB/R2011b/jobData
- Function called when submitting parallel jobs: Here you must specify a cell array containing a function handle, the submit host on the cluster, and the location on the remote system where job data are stored (analogous to the "local" job data directory). If you are a user of HERO, the entry could look as follows:
{@parallelSubmitFcn,'hero.hpc.uni-oldenburg.de','/user/hrz/abcd1234/MATLAB/R2011b/jobData'}
- Function called when submitting parallel jobs: Here you must specify a cell array containing a function handle, the submit host on the cluster, and the location on the remote system where job data are stored (analogous to the "local" job data directory). If you are a user of HERO, the entry could look as follows:
- Function called when submitting distributed jobs: Same as above, but with function handledistributedSubmitFcn
- Cluster nodes' OS: Select "unix"
- Function called when destroying a job: Enter destroyJobFcn
- Function called when getting the job state: Enter getJobStateFcn
- Job data location is accessible from both client and cluster nodes: Select "False"
After these preparations, you can validate your setup by pressing the Start Validation button in the Configurations manager.
Basic MDCS usage (example for submitting a task-parallel job)
Typical example of an "embarrassingly parallel" problem ("task-parallel" job in MATLAB terminology): parameter sweep of a 2nd-order ODE (damped Harmonic Oscillator).
ODE defined in odesystem.m.
Parameter sweep in param_Sweep_batch.m. Independent loop iterations automatically get distributed across 16 workers:
job = batch('paramSweep_batch', 'matlabpool', 15, 'FileDependencies', {'odesystem.m'});
Check state by 'Job Monitor or in command window:
job.State;
Analyze results:
load(job);
Runtime:
job.t1
Visualization:
figure; f=surf(jobData.bVals, jobData.kVals, jobData.peakVals); set(f,'LineStyle','none'); set(f,'FaceAlpha',0.5); xlabel('Damping'); ylabel('Stiffness'); zlabel('Peak Displacement'); view(50, 30);
Clean up
delete(job);
All files can be downloaded from ...
Advanced usage: Specifying resources
Old (non-MDCS) MATLAB usage
To submit a MATLAB job, you must first load the environment module in your submission script:
module load matlab
This automatically loads the newest version, if several versions are installed. After that, invoke MATLAB in batch mode:
matlab -nosplash -nodesktop -r mymatlab_input
where mymatlab_input.m (a so-called ".m-file") is an ASCII text file containing the sequence of MATLAB commands that you would normally enter in an interactive session.
Slides and links from last MATLAB workshop at the University of Oldenburg (19.02.2013)
Slides
Links
- Parallel-Computing Toolbox
- Distributed Computing Server
- Distributed Computing Server Webinars
- Material for Academia