MDCS Basic Example

From HPC users
Revision as of 14:50, 10 April 2015 by Harfst (talk | contribs) (Created page with "== Basic MDCS usage: Example of a task-parallel job == The following example illustrates how to submit a job to the cluster (FLOW or HERO). After you have entered the <tt>bat...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Basic MDCS usage: Example of a task-parallel job

The following example illustrates how to submit a job to the cluster (FLOW or HERO). After you have entered the batch command and the command returned with no error, you could, e.g., do some other MATLAB computations, close your MATLAB session, or even power down your machine. In the meantime, your MDCS job gets executed "in the background". After it has finished, you can analyse the results, visualize them, etc.

As an elementary example of an "embarrassingly parallel" (or "task-parallel" in MATLAB terminology) problem we take a parameter sweep of a 2nd order ODE (the damped Harmonic Oscillator). The parameters varied are the spring constant (or equivalently, the eigenfrequency) and the damping. For each pair of parameter values and fixed initial condition, the solution of the ODE (up to a certain maximum time) is calculated and the peak value of the amplitude is calculated and stored in an array.

The parameter sweep is achieved in a for loop. This can easily be parallelized using parfor instead. The total execution time of the parfor loop is measured with the tic and toc commands.

Please download the files containing the definition of the ODE system and the MATLAB script for the parameter sweep.

One way to run this code in parallel is to open a Matlabpool of the cluster and then executing the script that performs the parameter sweep. MATLAB then automatically distributes the independent loop iterations across the available workers (or "labs" in this case). Define a "scheduler object" sched which describes your cluster configuration (in the above example setup, the configuration named HERO), and submit the job to the cluster using the batch command:

sched = parcluster('HERO');
job = batch(sched, 'paramSweep_batch', 'Pool', 7, 'AttachedFiles', {'odesystem.m'});

The first time you submit a job to the cluster in a MATLAB session, you will be prompted for your credentials (username, password). Please enter your usual cluster account data. The number of "labs" is 1 + the number specified after the 'Pool' keyword. Thus in the above example, the job would run on eight workers. The specification of the file dependencies by the 'AttachedFiles' keyword is necessary since the script paramSweep_batch.m depends on odesystem.m and thus the latter must be copied to the cluster such that the script can run there (that is one of the purposes of the local and remote "job data" directories that must be specified in the configuration of the scheduler).

Check the state of the job with the Job Monitor (from the Main menue: Parallel -> Job Monitor), or in the command window:

job.State

By typing

job

you get additional useful information like, e.g., the start time of the job (if it has already started running), the current runtime, etc. Another useful command is

sched.getJobClusterData(job)

which will return, among other information, the job-ID from the cluster manager which can be used in turn to get more information later on (with the qacct command).

In order to analyze results after the job has finished, load the job data into the workspace:

jobData=load(job);

Check the runtime of the job:

jobData.t1

You can visualize the results by:

figure;
f=surf(jobData.bVals, jobData.kVals, jobData.peakVals);
set(f,'LineStyle','none');
set(f,'FaceAlpha',0.5);
xlabel('Damping'); ylabel('Stiffness'); zlabel('Peak Displacement');
view(50, 30);

If you no longer need the job, clean up (includes deletion of all files related to the job in the local job data directory):

delete(job);

The following table shows the runtime measured on the HERO cluster as a function of the number of workers:

Number of workers Runtime
1 (serial job) 220s
2 196s
4 69s
8 35s
16 17s
32 8.5s

Obviously, the simple parallelization strategy using the parfor loop leads to a significant speed-up in this case.