STATA

From HPC users
Jump to navigationJump to search

STATA comprises a complete software package, offering statistical tools for data analysis, data management and graphics. On the local HPC System we offer a multiprocessor variant of STATA/MP 13, licensed for up to 12 cores. The license allows up to 5 users to work with STATA at the same time. STATA/MP uses the paradigm of symmetric multiprocessing (SMP) to benefit from the parallel capabilities offered by many modern computers and HPC systems to speed up computations.

Accessing the HPC System

If you are not used to work in such an environment, it might first feel somewhat intricate and long-winded. However, experiences tells that in general the learning curve in using such environments is quite steep and you will soon become more efficient in using it.

Logging in to the HPC System

Advice on how to login to the HPC System from either within or outside the University can be found here. If you use a Windows desktop machine, a comfortable way to login to the HPC system is the PuTTY client. A guide on where to download the freely available putty software and how to use it in order to access, say, a Unix environment can be found here. Once you downloaded the putty software (basically you will only need the binary for the secure shell, called PuTTY.exe) you might follow the steps discussed in the paragraph on "Launch PuTTY and configure for the target system" in Sect. "1. Installation and config" found under the above link. Then, in the section where you are asked to specify the destination you want to connect to, provide the following information

 Host Name: hero.hpc.uni-oldenburg.de
 Port: 22
 Connection type: SSH

Below the PuTTY Client interface for the login procedure is shown:

PuTTY Screenshot.png

Commonly used Unix commands

On the HPC system you need to use a command line interface to navigate within your home directory and to submit your actual jobs. A list of useful unix-commands and a brief tutorial on how to use the commands are listed here.

Editing text files

From time to time you might want to edit a file, located somewhere in your HPC homedirectory. For this purpose you need to use some text editor. An easy to learn and simple text editor, available on the HPC system is, e.g., the editor nano.

Loading the STATA module

On the HPC system, the STATA/MP 13 software package is available as a software module. In order to load the respective module just type

 module load stata

Then you can find the following STATA variants in your user environment:

  • stata: a version of STATA that handles small datasets
  • stata-se: a version of STATA for large datasets
  • stata-mp: a fast version of STATA for multicore/multiprocessor machines

More details on the different version can be found here

Using STATA on the HPC system

To facilitate bookkeeping, a good first step towards using STATA on the HPC system is to create a directory in which all STATA related computations are carried out. Using the command

 mkdir stata

you might thus create the folder stata in the top level of your home directory for this purpose (you might even go further and create a subdirectory mp13 specifying the precise version of STATA).

Submitting a job: Single-slot variant

You might submit your STATA do-file using a job submission script similar to the script mySubmissionScript.sge listed below (with annotated line-numbers):

 
  1 #!/bin/bash
  2 
  3 #$ -S /bin/bash
  4 #$ -cwd
  5 
  6 #$ -l h_rt=0:10:0
  7 #$ -l h_vmem=300M
  8 #$ -l h_fsize=100M
  9 #$ -N stata_linReg_test
 10 
 11 module load stata
 12 /cm/shared/apps/stata/13/stata -b linReg.do
 13 mv linReg.log ${JOB_NAME}_jobId${JOB_ID}_linReg.log
  

Therein, in lines 6 through 8 the job requirements in terms of the resources running-time (h_rt), memory (h_vmem) and scratch space (h_fsize) are allocated. These and further resource allocation statements used in the job submission script are explained here. In line 9 a name for the job is set. The module containing the STATA software package is loaded in line 11. You need to load this module in each job submission script which is used to submit STATA jobs. In line 12 the STATA program is called in batch mode and the do-file is supplied (here the linear regression example do-file linReg.do set up previously). As a detail, note that the absolute path to the STATA executable is provided (you can verify the path by simply typing which stata on the command line after the respective module is loaded). By default, STATA creates a log file with a standardized name. Here, for the do-file linReg.do, STATA will create the default log file linReg.log. In case you want to call the underlying do-file several times, your results will be overwritten time after time. So it might be of use to change the standard log file name to include the actual name of the job and the unique job-Id assigned by the scheduler as is done in line 13. You can submit the script by simply typing

 qsub mySubmssionScript.sge

As soon as the job is enqueued you can check its status by typing qstat on the commanline. Immediately after submission you might obtain the output

 
job-ID  prior   name       user         state submit/start at     queue                  slots ja-task-ID 
---------------------------------------------------------------------------------------------------------
 909537 0.00000 stata_linR alxo9476     qw    09/02/2013 12:45:41                            1        
  

According to this, the job with ID 909537 has priority 0.00000 and resides in state qw, loosely translated to "enqueued and waiting". Also, the above output indicates that the job requires a number of 1 slots. The column for the ja-task-ID, referring to the id of the particular task stemming from the execution of a job array (we don't work through a job array since we submitted a single job), is actually empty. Soon after, the priority of the job will take a value in between 0.5 and 1.0 (usually only slightly above 0.5), slightly increasing until the job starts. In case the job already finished, it is possible to retrieve information about the finished job by using the qacct commandline tool, see here.

After the job has terminated successfully, the STATA log file stata_linReg_test_jobId909537_linReg.log is available in the directory from which the job has been submitted from. It contains a log of all the commands used in the STATA session and a summary of the linear regression carried out therein. Further, the directory contains the two files stata_linReg_test.o909537 and stata_linReg_test.e909537, containing additional output to the standard outstream and errorstream, respectively.

Submitting a job: Multi-slot variant

On the local HPC system, the concept of slots is used over "cores", and hence, the title of this subsection refers to the "Multi-slot" variant of using STATA. So as to benefit from the parallel capabilities offered by many modern computers and HPC systems and to speed up computations, STATA/MP uses the paradigm of symmetric multiprocessing (SMP). A performance report for a multitude of commands implemented in the STATA software package, highlighting the benefit of multiprocessing, can be found here. As pointed out above, the HPC system offers STATA/MP 13, licensed for up to 12 cores.

A proper job submission script by means of which you can use the multiprocessing capabilities of STATA/MP, here called mySubmissionScript_mp.sge, is listed below (with annotated line numbers):

 
  1 #!/bin/bash
  2 
  3 #$ -S /bin/bash
  4 #$ -cwd
  5 
  6 #$ -l h_rt=0:10:0
  7 #$ -l h_vmem=500M
  8 #$ -l h_fsize=1G
  9 #$ -N stata_linReg_test_smp
 10 
 11 #$ -pe smp 3
 12 #$ -R y
 13 
 14 export OMP_NUM_THREADS=$NSLOTS
 15 module load stata
 16 /cm/shared/apps/stata/13/stata-mp -b linReg.do
 17 mv linReg.log ${JOB_NAME}_jobId${JOB_ID}_linReg.log 
  

Note that in comparison to the single-slot submission script listed in the preceding subsection, you need to specify some more things! First, in line 11 the SMP parallel environment (PE) is requested and 3 slots are reserved. The option -R y in line 12 ensures that resource reservation is enabled in order to avoid starving of parallel jobs by serial jobs which "block" required slots on specific hosts. All further resource allocation statements used in the job submission script are explained here. At this point, note that the SMP parallel environment uses environment variables to controll the execution of parallel jobs at runtime. However, setting the SMP environment variables is as easy as setting other environment variables and it depends on which shell you are actually using. Above we specified a bash shell, thus, setting a value for the environment variable OMP_NUM_THREADS is done via

export OMP_NUM_THREADS=$NSLOTS

as can be seen in line 14. Also note that here, the multiprocessing variant stata-mp of the STATA program is called in line 16. Now, typing

qsub mySubmissionScript_mp.sge

enqueues the job, assigning the jobId 909618 in this case. As soon as the job is in state running, one can get an idea of where the parallel threads are running. In this question the query qstat -g t (which lists where exactly the job is running) yields

 
job-ID  prior   name       user         state submit/start at     queue                  master ja-task-ID 
----------------------------------------------------------------------------------------------------------
 909618 0.50535 stata_linR alxo9476     r     09/02/2013 16:27:29 mpc_std_shrt.q@mpcs101 MASTER        
                                                                  mpc_std_shrt.q@mpcs101 SLAVE         
                                                                  mpc_std_shrt.q@mpcs101 SLAVE         
                                                                  mpc_std_shrt.q@mpcs101 SLAVE  
  

Note that it is not a coincidence that all subprocess are running on the same execution host (the host mpcs101 in this case). The SMP parallel environment is somewhat special: it requires all the requested slots to be available on a single execution host, see here. A single execution host offers 12 slots, which perfectly fits the 12 core version of STATA/MP the HPC system offers. However, note that there are many users working on HERO at the same time. Upon submission you hand the job over to the scheduler which determines an execution host on which your job will be executed. In cases where the cluster is used extensively it might take some time until the resources you specified are finally available and a proper host for the execution of your job is free. Until then, the job will reside in the queue. Typically, the more slots you allocate the longer it will take until your job finally starts. Hence, it might be a good idea to not always request 12 slots for the execution of your job (i.e. an entire execution host), but a number lower than that which still lets you benefit from the multiprocessing capabilities of STATA/MP.

During the execution of your program, a stata log file (here:stata_linReg_test_smp_jobId909618_linReg.log) and four other log files are created: stata_linReg_test_smp.o909618 and stata_linReg_test_smp.e909618 to store job related data sent to standard out stream and error stream, as well as stata_linReg_test_smp.po909618 and stata_linReg_test_smp.pe909618 specifying output and errorlog for the startup of the parallel environment. The latter two files are of importance only if the job does not start up priorly and the reason is related to the start up of the parallel environment.

Note that in principle, there is a small time-overhead related to the start up of the parallel environment. Hence, for a job that finishes fast anyway, as, e.g. the basic linear regression example above, the use of the multiprocessing variant of STATA might not pay off.

Satisfying extended resource requirements

If you submit a STATA job using the job submission scripts from the preceding subsections, the scheduler assigns the job to one of the standard nodes on HERO. The total amount of memory available on these nodes is limited to 23G. Note that by using one of the job submission scripts shown above your job will only run if

  • the specified running time h_rt is smaller than 192h, i.e. 8 days. If your job needs to run longer, you need to explicitly specify this, see here.
  • the overall amount of memory, i.e. the number of requested slots times h_vmem (which is requested per slot), is smaller than 23G. If your job needs an overall amount of memory which exceeds 23G, you need to request one of the big nodes as execution host, see here.

In principle, all possible resource allocation statements are summarized here. How your job submission script (either sinlge-slot or multi-slot variant thereof) should look like in order to cope with extended resource requirements for the particular cases of long running times and high memory consumption is outlined subsequently.

Extensive running time

If your job is expected to run longer than 192h (i.e. 8 days) you need to tell the scheduler that it is an especially long running job. Therefore you need to set the flag longrun=True. The corresponding single-slot variant of the job submission script, here called mySubmissionScript_longrun.sge then should look similar to the one listed below (with annotated line numbers):

  
  1 #!/bin/bash
  2 
  3 #$ -S /bin/bash
  4 #$ -cwd
  5 
  6 #$ -l longrun=True
  7 #$ -l h_rt=200:0:0
  8 #$ -l h_vmem=300M
  9 #$ -l h_fsize=100M
 10 #$ -N stata_linReg_test_long
 11 
 12 module load stata
 13 /cm/shared/apps/stata/13/stata -b linReg.do
 14 mv linReg.log ${JOB_NAME}_jobId${JOB_ID}_linReg.log
  

Note that, in order to indicate a job with an extended running time, the longrun flag is set to True in line 6.

Correspondingly, the multi-slot variant of the job submission script, here called mySubmissionScript_longrun_smp.sge should look similar to (with annotated line numbers):

 
  1 #!/bin/bash
  2 
  3 #$ -S /bin/bash
  4 #$ -cwd
  5 
  6 #$ -l longrun=True
  7 #$ -l h_rt=200:0:0
  8 #$ -l h_vmem=500M
  9 #$ -l h_fsize=1G
 10 #$ -N stata_linReg_test_longrun_smp
 11 
 12 #$ -pe smp_long 3
 13 #$ -R y
 14 
 15 export OMP_NUM_THREADS=$NSLOTS
 16 module load stata
 17 /cm/shared/apps/stata/13/stata-mp -b linReg.do
 18 mv linReg.log ${JOB_NAME}_jobId${JOB_ID}_linReg.log
  

Again, note the longrun flag in line 6 and further, note that the proper name of the parallel environment in line 12 now reads smp_long.

High memory consumption

If the overall amount of memory your job is expected to consume, i.e. the number of requested slots times h_vmem (with is requested per slot), exceeds 23G, or if your job requires more than 800G scratch space, your job cannot run on a standard node on HERO. In such a situation you need to request a big node as execution host. Therefore you need to set the flag bignode=True. The corresponding single-slot variant of the job submission script, here called mySubmissionScript_bignode.sge then should look similar to the one listed below (with annotated line numbers):

  
  1 #!/bin/bash
  2 
  3 #$ -S /bin/bash
  4 #$ -cwd
  5 
  6 #$ -l bignode=True
  7 #$ -l h_rt=0:10:0
  8 #$ -l h_vmem=30G
  9 #$ -l h_fsize=100M
 10 #$ -N stata_linReg_test_big
 11 
 12 module load stata
 13 /cm/shared/apps/stata/13/stata -b linReg.do
 14 mv linReg.log ${JOB_NAME}_jobId${JOB_ID}_linReg.log  

Note that, in order to indicate a job with an extended memory requirement (here a single-slot job requesting 30G memory), the bignode flag is set to True in line 6.

Correspondingly, the multi-slot variant of the job submission script, here called mySubmissionScript_bignode_smp.sge should look similar to (with annotated line numbers):

 
  1 #!/bin/bash
  2 
  3 #$ -S /bin/bash
  4 #$ -cwd
  5 
  6 #$ -l bignode=True
  7 #$ -l h_rt=0:10:0
  8 #$ -l h_vmem=5G
  9 #$ -l h_fsize=10G
 10 #$ -N stata_linReg_test_longrun_smp
 11 
 12 #$ -pe smp 6
 13 #$ -R y
 14 
 15 export OMP_NUM_THREADS=$NSLOTS
 16 module load stata
 17 /cm/shared/apps/stata/13/stata-mp -b linReg.do
 18 mv linReg.log ${JOB_NAME}_jobId${JOB_ID}_linReg.log  

Again, note the bignode flag in line 6. Here, a 6-slot job with 5G per slot was requested. Finally, note that you can also combine the resource requirements bignode and longrun. Thereby, keep in mind that for the latter the proper name of the parallel environment is smp_long.

Checking the status of a job

After you submitted a job, the scheduler assigns it a unique job-ID. You might then use the qstat tool in conjunction with the job-ID to check the current status of the respective job. Details on how to check the status of a job can be found here. In case the job already finished, it is possible to retrieve information about the finished job by using the qacct tool, see here.

Mounting your home directory on Hero

Consider a situation where you would like to transfer a large amount of data to the HPC System in order to analyze it via STATA. Similarly, consider a situation where you would like to transfer lots of already processed data from your HPC account to your local workstation. Then it is useful to mount your home directory on the HPC System in order to conveniently cope with such a task. Details about how to mount your HPC home directory can be found here.

Benchmark for the particular command xtgls

A very nice feature of the STATA/MP software package is that it is very well documented. In this regard, a document exhaustively addressing the issue of how well the running time of a particular STATA command scales by using an increasing number of slots (or cores, for that matter) can be found here. Below a benchmark test of the STATA command xtgls, used to fit panel-data linear models using feasible generalized least squares (for further explanation, see here), carried out on the local HPC system for a big set of input data is presented.

To estimate the running time of the single slot variant of the corresponding job the STATA variant stata-se is used (since the problem instance is rather large). For all the multi-slot jobs, the variant stata-mp is used. For comparison, consider the documentation of the behavior of the xtgls command under parallel usage on page 33 and illustrated as Fig. 416 on page 139 of the above mentioned document. Note that therein, a different data set was considered for the benchmark. Also note that not all of the commands implemented in STATA are parallelizable to the same extend. A measure for this is the percentage of parallelization, which, according to the above mentioned document, amounts to 64% for the xtgls command. This sets an upper bound on the speedup that can theoretically be gained by using multiple cores. In case of the xtgls command, this theoretically maximal speedup is about 2.8. However, note that the percentage of parallelization is an experimentally observed quantity, assessed for the particular data set used by the STATA stuff.

Below, the running times (in seconds) of the xtgls command as function of the number of used slots is tabulated. Note that these numbers are listed as observed on the local HPC system for a different data set than used by the STATA stuff. This should be taken under consideration when comparing the speedup to the theoretical upper limit.

Number of slots Running time (sec.) Speedup
1 5798 (96 min) 1.0
2 4882 (81 min) 1.19
3 4500 (75 min) 1.29
4 3615 (60 min) 1.6
5 3622 (60 min) 1.6
6 3258 (54 min) 1.78
7 3309 (55 min) 1.75
8 3047 (50 min) 1.9
9 2959 (49 min) 1.96
10 3003 (50 min) 1.93
11 2967 (49 min) 1.95
12 2870 ( 47 min) 2.02

Further, the running time and the speedup relative to the single slot job are illustrated in the subsequent figures:

Benchmark ru wallclock.png

The speedup for the xtgls command relative to the single slot job found here attains its maximal value of approximately 2 for the maximally allowed number of 12 slots. Albeit this seems to be inferior compared to the ideal speedup, where doubling the number of slots would double the speedup, it is characteristic for applications that can only be parallelized up to some extend. Here, as quoted in the documentation of the STATA/MP commands, the maximal speedup that might theoretically be expected in the limit of an infinite number of slots is about 2.8. Note that there are other commands in the STATA software package that behave much better under parallelization than xtgls. However, there also ones that parallelize much worse.

One of the users observed that, using STATA/MP 2 on his laptop, the running time of the above application amounts to only 50 minutes as compared to 81 minutes on a computing node of the local HPC system (using 2 slots). A naive guess might be that this is due to the age of the hardware components! If the architecture of the laptop is newer than that of the HPC components, the difference in running time might be explained by this! To assess how the running time of the above STATA/MP application depends on the underlying hardware architecture we ran several benchmark tests on four different hardware components. Incidentally we had a testsystem from the seller pro-com GmbH available which allowed us to do so. The four different hardware components we considered for the benchmark tests were

Short name Hardware type Approx. release date
GOLEM Quad-Core AMD Opteron Proc. 2376, 2.2GHz (2x4 cores), 16GB 2009
HERO (Westmere) Intel Xeon X5650, 2.66GHz (2x6 cores), 48GB 2010
Sandy-Bridge Intel Xeon E5-2670, 2.60GHz (2x8 cores), 64GB 2011
Ivy-Bridge Intel Xeon E5-2680 v2, 2.80Ghz (2x10 cores), 64GB 2012

Restricting the STATA/MP software to use only 2 slots (and 4 slots), we found the following runningtimes for the above application:

N (# slots) GOLEM HERO Sandy-Bridge Ivy-Bridge
2 77m42.196s 71m33.623s 32m59.799s 34m17.908s
4 70m52.00s 65m22.00s 31m11.718s 32m44.422s

Hence, the observed difference in running times can be explained well by the difference in hardware. In this regard, a dramatic decrease in running time (at fixed N) can be observed for the newer architectures (GOLEM is the oldest hardware component and the Ivy-Bridge architecture is the most recent one). In particular, in between 2010 (Westmere architecture; used by the current HPC System) and 2011 (Sandy-Bridge architecture) there was an improvement of the underlying micro architecture, see here.

Note that we also performed a further benchmark, solely based on STATAs default data-bases (so the test is less special than the above application and might be reproduced by any STATA user on basically any system on which STATA is available). The details of this benchmark can be found below.

A more general benchmark

We also performed a further benchmark, solely based on one of STATAs default data-bases. Hence, the test is less special than the above application and might be reproduced by any STATA user on basically any system on which STATA is available. The considered dataset is called auto (an example data base containing 1978 Automobile Data; more default data sets can be listed via the command sysuse dir) and the STATA do-file, here called benchmark_auto.do used for the test reads:

 
clear all
args nProc
set processors `nProc' 
set memory 1024m
sysuse auto
expand 10000

creturn list

bootstrap: regress mpg weight gear foreign
  

As input parameter the do-file accepts the number of processors that will be used by STATA in order to perfom the computation. Initially the data set is rather small. To be precise, the dataset only contains 74 observations. Therefore, the overhead of parallelizing the process for a given number of CPUs is rather large compared to the problem size. Thus, so as to put emphasis on the benefits obtained via parallelization we expanded the size of the dataset. The command creturn list was used to list the system parameters (and further settings) to check whether STATA was invoked correctly during the benchmarks.

So as to submit the do-file we used a job submission script similar to the following (called myBenchmark_HERO.sge):

 
#!/bin/bash

#$ -S /bin/bash
#$ -cwd

#$ -l h_rt=01:00:0
#$ -l h_vmem=1.5G
#$ -l h_fsize=1G
#$ -N stata_benchmark_hero

#$ -pe smp 2
#$ -R y

module load stata

time /cm/shared/apps/stata/13/stata-mp -b  benchmark_auto.do $NSLOTS 
mv benchmark_auto.log benchmark_auto_HERO_N${NSLOTS}.log.${JOB_ID}
  

After sucessful termination of the job there are two important files that detail the running time and the output of the job, i.e. stata_benchmark_hero.e<JobId> and benchmark_auto_HERO_N2.log.<JobId>, respectively.

E.g. considering N=2 the latter file contains (among other things)

 
[…]
System values
-------------

    ---------------------------------------------------------------------------
        c(current_date) = "18 Oct 2013"
        c(current_time) = "11:27:39"
           c(rmsg_time) = 0                          (seconds, from set rmsg)
    ---------------------------------------------------------------------------
       c(stata_version) = 13
             c(version) = 13                         (version)
    ---------------------------------------------------------------------------
           c(born_date) = "17 Jun 2013"
              c(flavor) = "IC"
                 c(bit) = 64
                  c(SE) = 1
                  c(MP) = 1
          c(processors) = 2                          (Stata/MP, set processors)
      c(processors_lic) = 12
     c(processors_mach) = 12
      c(processors_max) = 12
                c(mode) = "batch"
             c(console) = "console"
    ---------------------------------------------------------------------------
                  c(os) = "Unix"
               c(osdtl) = ""
            c(hostname) = "mpcs035"
        c(machine_type) = "PC (64-bit x86-64)"
           c(byteorder) = "lohi"
            c(username) = "alxo9476"
    ---------------------------------------------------------------------------
[...]
Current dataset
---------------

    ---------------------------------------------------------------------------
                   c(N) = 740000
                   c(k) = 12
               c(width) = 43
             c(changed) = 1
            c(filename) = "/cm/shared/apps/s.."
            c(filedate) = "13 Apr 2013 17:45"
    ---------------------------------------------------------------------------

[...]
  

To emphasize just a few of these entries: It lists the number of processors used by STATA (keyword c(processors)), the name of the host on which the computations are carried out (keyword c(hostname)) and it lists the number of entries in the currently considered database (keyword c(N)).

The running times, resulting from running the above do-file on the four different hardware types, considering 1 through 12 CPUs (slots) are listed below (note that the GOLEM quad-core nodes offer a maximal number of 8 slots only):

N (# of slots) GOLEM HERO Sandy-Bridge Ivy-Bridge
1 2m1.685s 1m22.892s 0m52.818s 0m51.799s
2 2m1.182s 1m8.083s 0m45.075s 0m44.326s
3 1m40.085s 1m5.990s 0m35.493s 0m36.416s
4 1m29.342s 0m55.530s 0m34.676s 0m33.026s
5 1m15.151s 0m56.992s 0m29.904s 0m30.762s
6 1m15.533s 0m44.624s 0m30.888s 0m29.175s
7 1m7.784s 0m39.090s 0m28.292s 0m28.945s
8 1m8.152s 0m38.694s 0m26.873s 0m27.332s
9 - 0m39.084s 0m26.486s 0m27.148s
10 - 0m35.500s 0m26.109s 0m26.318s
11 - 0m39.989s 0m25.690s 0m25.314s
12 - 0m37.644s 0m27.689s 0m25.332s

Again, a dramatic decrease in running time (at fixed N) can be observed for the newer architectures:

Benchmark auto.png

Troubleshooting

If you exhibit difficulties running your job on the HPC system, the reason might be due to one of the following issues:

  • Your job submission script was written on a Windows based system and copied to a Unix based system (such as the local HPC system): you might exhibit problems that relate to the circumstance that the character sequence that represents the end-of-line in a file might differ on distinct operating systems, see here. A remedy might e.g. be to convert your job submission script (on HERO) using the Unix tool dos2unix, a DOS/MAC to UNIX text file format converter, which is available on HERO. You can find more detailed information by typing man dos2unix on the commandline on HERO.