Difference between revisions of "STATA"

From HPC users
Jump to navigationJump to search
Line 84: Line 84:


=== Submitting a job: Multi-slot variant ===
=== Submitting a job: Multi-slot variant ===
So as to benefit from the parallel capabilities offered by many modern computers and HPC systems and to speed up computations, STATA/MP used the paradigm of [http://en.wikipedia.org/wiki/Symmetric_multiprocessing| symmetric multiprocessing] (SMP).  A performance report for a multitude of commands implemented in the STATA software package, highlighting the benefit of multiprocessing, can be found [http://www.stata.com/statamp/statamp.pdf| here]. As pointed out above, the HPC system offers STATA/MP 13, licensed for up to 12 cores. On the local HPC system, the concept of ''slots'' is used over "cores", and hence, the title of this subsection refers to the "Multi-slot" variant of using STATA.


== Checking the status of a job ==
== Checking the status of a job ==

Revision as of 16:22, 2 September 2013

STATA comprises a complete software package, offering statistical tools for data analysis, data management and graphics. On the local HPC System we offer a multiprocessor variant of STATA/MP 13 for up to 12 cores.

Logging in to the HPC System

Advice on how to login to the HPC System from either within or outside the University can be found here.

Loading the STATA module

On the HPC system, the STATA/MP 13 software package is available as a software module. In order to load the respective module just type

 module load stata

Using STATA on the HPC system

To facilitate bookkeeping, a good first step towards using STATA on the HPC system is to create a directory in which all STATA related computations are carried out. Using the command

 mkdir stata

you might thus create the folder stata in the top level of your home directory for this purpose (you might even go further and create a subdirectory mp13 specifying the precise version of STATA).

Using STATA in batch mode

On the local HPC system the convention is to use applications in batch mode rather than interactive mode as you would do on your local workstation. This requires you to list the commands you would otherwise interactively type in STATAs interactive mode in a file, called do-file in STATA jargon, and to call STATA in conjunction with the -b option on that do-file. To illustrate how to use STATA in batch mode on the HPC system, consider the basic linear regression example contained in Chapter 1 of the STATA Web Book Regression with STATA. For this linear regression example you might further create the subdirectory linear_regression and put the data sets on which you would like to work and all further supplementary files and scripts there. A do-file corresponding to the basic lienear regression example, here called linReg.do, reads:

 
use elemapi
regress api00 acs_k3 meals full 
  

For the do-file to run properly, the data file available as http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi needs to be stored in the directory linear_regression. Further, if you did not load the STATA module yet, you need to load it via

 module load stata

before you attempt to use the STATA application. In principle you could now call STATA in batch mode by typing

 stata -b linReg.do

Albeit this is fully okay for small test programs that consume only few resources (in terms of running time and memory), the convention on the HPC system rather is to submit your job to the scheduler (here we use Sun grid engine (SGE) as scheduler) which assigns it to a proper execution host on which the actual computations are carried out. Therefore you have to setup a job submission file by means of which you allocate certain resources for your job. This is common practice on HPC systems on which multiple users access the available resources at a given time. Examples of such job submission scripts for both, single-core and multi-core usage, are detailed below.

Submitting a job: Single-slot variant

You might submit your STATA do-file using a job submission script similar to the script mySubmissionScript.sge listed below (with annotated line-numbers):

 
  1 #!/bin/bash
  2 
  3 #$ -S /bin/bash
  4 #$ -cwd
  5 
  6 #$ -l h_rt=0:10:0
  7 #$ -l h_vmem=300M
  8 #$ -l h_fsize=100M
  9 #$ -N stata_linReg_test
 10 
 11 module load stata
 12 /cm/shared/apps/stata/13/stata -b linReg.do
 13 mv linReg.log ${JOB_NAME}_jobId${JOB_ID}_linReg.log
  

Therein, in lines 6 through 8 the job requirements in terms of the resources running-time (h_rt), memory (h_vmem) and scratch space (h_fsize) are allocated. In line 9 a name for the job is set. The module containing the STATA software package is loaded in line 11. You need to load this module in each job submission script which is used to submit STATA jobs. In line 12 the STATA program is called in batch mode and the do-file is supplied (here the linear regression example do-file linReg.do set up previously). As a detail, note that the absolute path to the STATA executable is provided (you can verify the path by simply typing which stata on the command line after the respective module is loaded). By default, STATA creates a log file with a standardized name. Here, for the do-file linReg.do, STATA will create the default log file linReg.log. In case you want to call the underlying do-file several times, your results will be overwritten time after time. So it might be of use to change the standard log file name to include the actual name of the job and the unique job-Id assigned by the scheduler as is done in line 13. You can submit the script by simply typing

 qsub mySubmssionScript.sge

As soon as the job is enqueued you can check its status by typing qstat on the commanline. Immediately after submission you might obtain the output

 
job-ID  prior   name       user         state submit/start at     queue                  slots ja-task-ID 
---------------------------------------------------------------------------------------------------------
 909537 0.00000 stata_linR alxo9476     qw    09/02/2013 12:45:41                            1        
  

According to this, the job with ID 909537 has priority 0.00000 and resides in state qw, loosely translated to "enqueued and waiting". Also, the above output indicates that the job requires a number of 1 slots. The column for the ja-task-ID, referring to the id of the particular task stemming from the execution of a job array (we don't work through a job array since we submitted a single job), is actually empty. Soon after, the priority of the job will take a value in between 0.5 and 1.0 (usually only slightly above 0.5), slightly increasing until the job starts. In case the job already finished, it is possible to retrieve information about the finished job by using the qacct commandline tool, see here.

After the job has terminated successfully, the STATA log file stata_linReg_test_jobId909537_linReg.log is available in the directory from which the job has been submitted from. It contains a log of all the commands used in the STATA session and a summary of the linear regression carried out therein. Further, the directory contains the two files stata_linReg_test.o909537 and stata_linReg_test.e909537, containing additional output to the standard outstream and errorstream, respectively.

Submitting a job: Multi-slot variant

So as to benefit from the parallel capabilities offered by many modern computers and HPC systems and to speed up computations, STATA/MP used the paradigm of symmetric multiprocessing (SMP). A performance report for a multitude of commands implemented in the STATA software package, highlighting the benefit of multiprocessing, can be found here. As pointed out above, the HPC system offers STATA/MP 13, licensed for up to 12 cores. On the local HPC system, the concept of slots is used over "cores", and hence, the title of this subsection refers to the "Multi-slot" variant of using STATA.

Checking the status of a job

After you submitted a job, the scheduler assigns it a unique job-ID. You might then use the qstat tool in conjunction with the job-ID to check the current status of the respective job. Detail on how to check the status of a job can be found here. In case the job already finished, it is possible to retrieve information about the finished job by using the qacct tool, see here.

Mounting your home directory on Hero

Consider a situation where you would like to transfer a large amount of data to the HPC System in order to analyze it via STATA. Similarly, consider a situation where you would like to transfer lots of already processed data from your HPC account to your local workstation. Then it is useful to mount your home directory on the HPC System in order to conveniently cope with such a task. Details about how to mount your HPC home directory can be found here.