Difference between revisions of "Rbatchtools"

From HPC users
Jump to navigationJump to search
Line 12: Line 12:
Next you start up R (still on the login node), so the following commands are all R-commands (the R-prompt is neglected here for easy cut'n'paste). The package is loaded with:
Next you start up R (still on the login node), so the following commands are all R-commands (the R-prompt is neglected here for easy cut'n'paste). The package is loaded with:
  library(batchtools)
  library(batchtools)
We also need to create a directory in which we can store a so-called registry (that we will create in the next step). We can do this by the following to commands:
myRegistryDir <- paste0(Sys.getenv("WORK"), "/R/Registries")
system(paste("mkdir -pv", myRegistryDir))
which will create the directory <tt>$WORK/R/Registries</tt>. Using <tt>$WORK</tt> is recommended but you can of course call the directory anything you want. Note, that you should not delete any subdirectories or files in this directory, otherwise you can loose the results from completed jobs. You can also add the first line starting with <tt>myRegistryDir</tt> to your <tt>$HOME/.Rprofile</tt>-file so that the variable <tt>myRegistryDir</tt> is always set.
Next we create a registry to contain the tasks that are later submitted as individual jobs to the cluster (you can of course modify the directory in <tt>td</tt> to your needs):
Next we create a registry to contain the tasks that are later submitted as individual jobs to the cluster (you can of course modify the directory in <tt>td</tt> to your needs):
  td  <- tempfile(pattern="test", tmpdir=paste0(Sys.getenv("WORK"),"/R"))
  td  <- tempfile(pattern="test", tmpdir=paste0(Sys.getenv("WORK"),"/R"))

Revision as of 17:19, 13 July 2021

Introduction

"As a successor of the packages BatchJobs and BatchExperiments, batchtools provides a parallel implementation of Map for high performance computing systems managed by schedulers like Slurm, Sun Grid Engine, OpenLava, TORQUE/OpenPBS, Load Sharing Facility (LSF) or Docker Swarm (see the setup section in the vignette)."[1]

One advantage of batchtools is that you can use it within the familiar R-environment, there is no need to learn about job scripts. In addition, it allows simple parallelization for independent tasks.

How to use batchtools

First of all, you need to load a recent R-module on one of the login nodes (batchtools should be available in all R-installations on the cluster but using a recent one is recommended):

$ module load hpc-env/8.3
$ module load R

Next you start up R (still on the login node), so the following commands are all R-commands (the R-prompt is neglected here for easy cut'n'paste). The package is loaded with:

library(batchtools)

We also need to create a directory in which we can store a so-called registry (that we will create in the next step). We can do this by the following to commands:

myRegistryDir <- paste0(Sys.getenv("WORK"), "/R/Registries")
system(paste("mkdir -pv", myRegistryDir))

which will create the directory $WORK/R/Registries. Using $WORK is recommended but you can of course call the directory anything you want. Note, that you should not delete any subdirectories or files in this directory, otherwise you can loose the results from completed jobs. You can also add the first line starting with myRegistryDir to your $HOME/.Rprofile-file so that the variable myRegistryDir is always set. Next we create a registry to contain the tasks that are later submitted as individual jobs to the cluster (you can of course modify the directory in td to your needs):

td  <- tempfile(pattern="test", tmpdir=paste0(Sys.getenv("WORK"),"/R"))
reg = makeRegistry(file.dir = td, seed = 1)

Now we define a function for the task we want to solve, in our case the approximation of Pi:

piApprox = function(n) {
  nums = matrix(runif(2 * n), ncol = 2)
  d = sqrt(nums[, 1]^2 + nums[, 2]^2)
  4 * mean(d <= 1)
}
set.seed(42)
piApprox(1000)

In the last two lines above, we test the function for a given seed and with two times 1000 random numbers. The output should be something like 3.14. Next, we create a list of ten jobs which each will evaluate the function with n=1e5:

batchMap(fun = piApprox, n = rep(1e5, 10))

Once the jobs have been created, we can submit them:

submitJobs(resources = list(walltime = 3600, memory = 1024))

Setting the resources is optional, if not used some defaults will be used. In the example, the runtime is set to 1 hour and the memory to 1024M. Now, batchtools is creating the job scripts from a template and submits them to the cluster. We can get the status and wait for the jobs to complete:

getStatus()
waitForJobs()

Finally, when the jobs are finished, you load the results. In our case, we calculate a mean value:

mean(sapply(1:10, loadResult))