Difference between revisions of "R 2016"

From HPC users
Jump to navigationJump to search
 
(35 intermediate revisions by 4 users not shown)
Line 1: Line 1:
== Introduction ==
== Introduction ==


R is a free software environment for statistical computing and graphics.
R is a programming language and environment commonly used in statistical computing, data analytics and scientific research.
 
== Installed version ==
The currently installed versions of R are
 
on environment '''hpc-env/8.3''':
R/4.0.2-foss-2019b
R/4.1.0-foss-2019b-2021.06
R-core/4.1.0-foss-2019b
R-core/4.2.2-intel-2019b
 
on environment '''hpc-env/6.4''':
R/3.4.4-intel-2018a
R/3.5.2-intel-2018a
R/3.6.1-intel-2018a
 
on environment '''hpc-uniol-env''':
R/3.3.1
 
To find all installed versions you can use the command
$ module -r spider ^R/
and to show the available versions in the currently loaded environment use
$ module -r available ^R/
 
Note, that starting with version 4.1.0 we have changed the way how we provide the almost 1.000 R-packages we install per default. Using any of the modules listed above you will get a similar experience when using R. However, modules with an extra suffix in the form yyyy.mm indicate the year and month of the installation, allowing us to provide updates to the packages without touching the original installation.
 
=== Additional installed packages ===
 
The R release contains a lot of additional packages. After loading and starting R ("module load R" and simply "R" on the command line), you can generate a list of all of them by using the following commands
ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)
 
You will receive a list of every package and its related version. It should look like this:
        Package    Version
            abc        2.1
      abc.data        1.0
          abind      1.4-3
        acepack    1.3-3.3
        adabag        4.1
 
=== Additional installed R-Modules ===
In case new R-packages are being requested, the original R-module (e.g.: R/4.0.2-foss-2019b) usually will not be modified, but instead the HPC team installs a new loadable module which contains the required package as well as all necessary dependency-packages. The advantage of this variant is that the integrity of the original R package remains secured (no change of the individual package versions, etc.). The disadvantage here, however, is that the presence of individual installed packages remains hidden to many users, provided that only the R module is loaded.
 
So, if you do not find the package you need, it can often be worthwhile to search for already installed R-packages. Either by searching specifically for the required package name (<tt>ml av <desired_package></tt>), or by searching for the R suffix. <br/>
The following command can be used to find all packages that are based on R:
<tt>ml av R-</tt>
Among others, this will print the module '''R-bundle-Bioconductor/3.12-foss-2019b-R-4.0.2''', which itself provides a variety of other R packages.
 
=== Installing your own packages ===
 
If your are missing an R-packages you can contact {{sc}} or, alternatively install the package in your own <tt>HOME</tt> directory. To do so you should first create a directory on the cluster, e.g. with
$ mkdir -p $HOME/R/lib
This would create a directory 'R' with the subdirectory 'lib' in your HOME folder.
Now, we need to create two files and in some lines. To make this as easy as possible, you can just copy the following line into your console:
<pre>
echo -en "R_LIBS_USER=\"~/R/lib\"\n" >> $HOME/.Renviron
</pre>
 
Afterwards, the file <tt>.Renvrion</tt> should include the following line (if the file did not exit before, it will only contain this line):
<pre>
$ cat $HOME/.Renviron
...
R_LIBS_USER="~/R/lib"
</pre>
 
You can choose a different location for installing your R-libraries if you wish (by setting <tt>R_LIBS_USER</tt> in <tt>.Renviron</tt> differently). There are also alternative mirrors, set your preferred on in <tt>.Rprofile</tt>.
 
Once this is done, you start R on the login node and begin installing packages:
<pre>
$ R
> install.packages("lme4", repo = "https://ftp.gwdg.de/pub/misc/cran", lib = "~/R/lib")
> library(lme4)
> library(car)
</pre>
In this example, the packages <tt>lme4</tt> will be downloaded from the GWDG CRAN mirror (if you omit the option <tt>repo</tt> you will be able to select a mirror from a list) and installed in the directory given by the <tt>lib</tt>-option (if you omit that option, you will be asked if you want to install in a personal folder). Please note that R will not check if a package is already installed and will always reinstall a package by overwriting the previous install (you can program a logic for that into your R programs). You may want to separate package installation from the execution of jobs.
 
The package <tt>lme4</tt> is already installed in the global R-installation (version 1.1-12) whereas the installation above will install a newer version (1.1-13 or newer). When you load the package with <tt>library(lme4)</tt>, the installation in your <tt>$HOME</tt> folder will be used. You can verify this with the R-command
> sessionInfo()
  ... lme4_1.1-13 ...
The next call <tt>library(car)</tt> will load the globally installed package <tt>car</tt>. So you do not need to install every package in your <tt>$HOME</tt>, only those packages that are missing or when you require an updated version.
 
If you encounter problems when installing an R package in your <tt>$HOME</tt>, for example because a non-R dependency is missing, then please contact {{sc}}. Also note, that some package may require to use the installer from the <tt>BioConductor</tt> package.
 
'''Using multiple R versions'''
 
In case you are planning to use multiple R versions or want to migrate from one version to the next, you may have to reinstall the packages in your own personal library. For example, going from <tt>R/3.5.2</tt> to <tt>R/3.6.1</tt> or <tt>R/4.0.2</tt> probably requires to reinstall all packages (an update to <tt>R/3.5.3</tt>, i.e. a bugfix release, should not be a problem). If required, you can also keep packages for multiple versions, in which case you should create directories of the form <tt>$HOME/R/x.y/lib</tt>, e.g.
$ mkdir -pv $HOME/R/4.0/lib
and then set
R_LIBS_USER="~/R/%v/lib"
in <tt>$HOME/.Renviron</tt>. The <tt>%v</tt> will be expanded the version of R that you are using and allows you to have multiple <tt>lib</tt>-directories.


== Using R on the HPC cluster ==
== Using R on the HPC cluster ==
Line 7: Line 98:
If you want to use R on the HPC cluster, you will have to load its module. You can do that by using the command
If you want to use R on the HPC cluster, you will have to load its module. You can do that by using the command
  module load R
  module load R
Since there is only one version of R installed, you dont need to specify a version. If you use the command
Since R is installed on multiple environments and in different versions, possibly you will have to change environment and specify the version
  module spider R
module load hpc-env/6.4
you will find more informations about the module.
module load R/3.5.2-intel-2018a
 
=== Basic Job Script for R ===
 
Suppose you want to create 100 random numbers and calculate their mean and standard deviation. In R the commands for that would be:
x <- runif(100,0.0,1.0)
mean(x)
sd(x)
If you want to do the calculation on the cluster store the above commands in a file named e.g. <tt>Rtest.R</tt>. Then create a job script <tt>Rtest.sh</tt> with the content:
<pre>
#!/bin/bash
 
#SBATCH --job-name=Rtest
#SBATCH --partition=carl.p
#SBATCH --time=24:00:0
#SBATCH --mem=5000M
 
# load modules
module load R
 
# run R
Rscript ./Rtest.R
</pre>
and submit a job with the command:
sbatch Rtest.sh
The output of R will appear in a file called <tt>slurm-<jobid>.out</tt> once the job has been completed. Instead of the command
  Rscript ./Rtest.R
you can also use
R CMD BATCH ./Rtest.R
in which case you would find (a slightly different output) in a file called <tt>Rtest.Rout</tt>. Try out the different commands (and maybe also additional options that can be passed) to see which serves your needs best.
 
=== Using batchtools ===
 
In some situations you may need to run the same R-program multiple times. This can be achieved with the approach described below using <tt>foreach</tt> and <tt>doMPI</tt>. Another option is the R-package <tt>batchtools</tt> as described on [[Rbatchtools|this page]]


=== Usage of R and MPI ===
=== Usage of R and MPI ===


For parallelization the packages doMPI and Rmpi are installed. To launch an parallel R script inside a [[SLURM Job Management (Queueing) System | SLURM]] script please use command line
For parallelization the package doMPI is installed. To launch an parallel R script inside a [[SLURM Job Management (Queueing) System | SLURM]] script please use command line


   mpirun -np $NSLOTS R --slave -f ''SCRIPTNAME'' ''SCRIPT_CMDLINE_OPTIONS''
   mpirun R --slave -f ''SCRIPTNAME'' ''SCRIPT_CMDLINE_OPTIONS''


to enable SLURM to control all processes of your script. Please '''do not''' use the batch starting sequence ''R CMD BATCH''!
to enable SLURM to control all processes of your script. Please '''do not''' use the batch starting sequence ''R CMD BATCH''!
Line 23: Line 147:
   #SBATCH --ntasks=NUMBER_OF_TASKS
   #SBATCH --ntasks=NUMBER_OF_TASKS


'''Note (only!) for doMPI:'''
'''Note for doMPI:'''
* Before you start R with the ''mpirun'' command you have to unset the environment variable R_PROFILE in your SGE-Script. Otherwise the MPI processes were not spawned. Please add following line to your SGE-script
* Before you start R with the ''mpirun'' command you have to unset the environment variable R_PROFILE in your SLURM-Script. Otherwise the MPI processes were not spawned. Please add following line to your jobscript:


   unset R_PROFILE
   unset R_PROFILE


* Please use mpi.quit() at the end of your script. Otherwise it will not end.
* Please use mpi.quit() at the end of your script. Otherwise it will not end.
* Here a small demo R script for doMPI (it writes in b the current rank of MPI):
* Here a small example R script for doMPI (it writes the current rank of MPI in b):
<pre>
#!/usr/bin/env Rscript
#
# file name: test_dompi.R
#
 
library("doMPI")
 
# doMPI start
cl <- startMPIcluster()
registerDoMPI(cl)
 
# parallel foreach due to %dopar% using the MPI cluster
# note that one MPI process is the master (rank 0)and
# distributes the work (iterations) among the other
# processes (ranks 1 to (mpi.comm.size(0)-1))
# rnorm returns a vector with three elements, the
# option .combine="rbind" makes a table with 10 rows
b<-foreach(i=1:10, .combine="rbind") %dopar% {
  my_rank<-as.integer(mpi.comm.rank(0))
  rnorm(3, my_rank, 0.01)  # return three random values near my_rank
}
closeCluster(cl)
print(b)
 
mpi.quit()
</pre>
 
and he corresponding SLURM-script
<pre>
#!/bin/bash


  #!/usr/bin/env Rscript
#SBATCH --job-name=test_dompi
  #
#SBATCH --time=24:00:0
  # file name: test_dompi.R
#SBATCH --mem-per-cpu=2G
  #
#SBATCH --output=dompi-test.%j.out       
 
#SBATCH --error=dompi-test.%j.err
  library(doMPI)
#SBATCH --ntasks=4
  library("foreach")
 
 
# load modules
  # doMPI start
module load hpc-env/8.3
  cl <- startMPIcluster()
module load R
  registerDoMPI(cl)
 
  b<-foreach(i=0:1000, .combine="c") %dopar% {
    as.integer(Sys.getenv("PMI_RANK"))
  }
  closeCluster(cl)
  print(b)
 
  mpi.quit()


and he corresponding SGE-script
# unset the environment variable which is needed for Rmpi
  #!/bin/bash
# but makes problems with doMPI
unset R_PROFILE
    
    
  #$ -S /bin/bash
# run R in parallel (mpirun knows the number of tasks requested)
  #$ -N test_dompi
mpirun R --slave -f ./test_dompi.R
  #$ -cwd
</pre>
  #$ -l h_rt=24:00:0
  #$ -l h_vmem=1800M
  #$ -pe impi 36
  #$ -R y
  #$ -j y
 
  # load modules
  module load r/3.2.1
 
  # unset the environment variable which is needed for Rmpi
  # but makes problems with doMPI
  unset R_PROFILE
 
  # run R in parallel
  mpirun -np $NSLOTS R --slave -f ./test_dompi.R
 
'''Note (only!) for Rmpi:'''
* The MPI processes were spawned by the ''mpirun'' command. The Rmpi command ''mpi.spawn.Rslaves()'' is not necessary and should not be used within the script!


=== Usage of NetCDF and R ===
=== Usage of NetCDF and R ===
Line 84: Line 213:
to load the NetCDF library. Please refer to the documentations of NetCDF and R for more informations.
to load the NetCDF library. Please refer to the documentations of NetCDF and R for more informations.


== Installed version ==
The currently installed version of R is '''3.3.1'''.
=== Additional installed packages ===
The R release contains a lot of additional packages. After loading and starting R ("module load R" and simply "R" on the command line), you can generate a list of all of them by using the following commands
ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)
You will receive a list of every package and its related version. It should look like this:
        Package    Version
            abc        2.1
      abc.data        1.0
          abind      1.4-3
        acepack    1.3-3.3
        adabag        4.1


== Documentation ==
== Documentation ==

Latest revision as of 10:07, 29 November 2022

Introduction

R is a programming language and environment commonly used in statistical computing, data analytics and scientific research.

Installed version

The currently installed versions of R are

on environment hpc-env/8.3:

R/4.0.2-foss-2019b
R/4.1.0-foss-2019b-2021.06
R-core/4.1.0-foss-2019b
R-core/4.2.2-intel-2019b

on environment hpc-env/6.4:

R/3.4.4-intel-2018a
R/3.5.2-intel-2018a
R/3.6.1-intel-2018a

on environment hpc-uniol-env:

R/3.3.1

To find all installed versions you can use the command

$ module -r spider ^R/

and to show the available versions in the currently loaded environment use

$ module -r available ^R/

Note, that starting with version 4.1.0 we have changed the way how we provide the almost 1.000 R-packages we install per default. Using any of the modules listed above you will get a similar experience when using R. However, modules with an extra suffix in the form yyyy.mm indicate the year and month of the installation, allowing us to provide updates to the packages without touching the original installation.

Additional installed packages

The R release contains a lot of additional packages. After loading and starting R ("module load R" and simply "R" on the command line), you can generate a list of all of them by using the following commands

ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)

You will receive a list of every package and its related version. It should look like this:

       Package     Version
           abc         2.1
      abc.data         1.0
         abind       1.4-3
       acepack     1.3-3.3
        adabag         4.1

Additional installed R-Modules

In case new R-packages are being requested, the original R-module (e.g.: R/4.0.2-foss-2019b) usually will not be modified, but instead the HPC team installs a new loadable module which contains the required package as well as all necessary dependency-packages. The advantage of this variant is that the integrity of the original R package remains secured (no change of the individual package versions, etc.). The disadvantage here, however, is that the presence of individual installed packages remains hidden to many users, provided that only the R module is loaded.

So, if you do not find the package you need, it can often be worthwhile to search for already installed R-packages. Either by searching specifically for the required package name (ml av <desired_package>), or by searching for the R suffix.
The following command can be used to find all packages that are based on R:

ml av R-

Among others, this will print the module R-bundle-Bioconductor/3.12-foss-2019b-R-4.0.2, which itself provides a variety of other R packages.

Installing your own packages

If your are missing an R-packages you can contact Scientific Computing or, alternatively install the package in your own HOME directory. To do so you should first create a directory on the cluster, e.g. with

$ mkdir -p $HOME/R/lib

This would create a directory 'R' with the subdirectory 'lib' in your HOME folder. Now, we need to create two files and in some lines. To make this as easy as possible, you can just copy the following line into your console:

echo -en "R_LIBS_USER=\"~/R/lib\"\n" >> $HOME/.Renviron

Afterwards, the file .Renvrion should include the following line (if the file did not exit before, it will only contain this line):

$ cat $HOME/.Renviron
...
R_LIBS_USER="~/R/lib"

You can choose a different location for installing your R-libraries if you wish (by setting R_LIBS_USER in .Renviron differently). There are also alternative mirrors, set your preferred on in .Rprofile.

Once this is done, you start R on the login node and begin installing packages:

$ R
> install.packages("lme4", repo = "https://ftp.gwdg.de/pub/misc/cran", lib = "~/R/lib")
> library(lme4)
> library(car)

In this example, the packages lme4 will be downloaded from the GWDG CRAN mirror (if you omit the option repo you will be able to select a mirror from a list) and installed in the directory given by the lib-option (if you omit that option, you will be asked if you want to install in a personal folder). Please note that R will not check if a package is already installed and will always reinstall a package by overwriting the previous install (you can program a logic for that into your R programs). You may want to separate package installation from the execution of jobs.

The package lme4 is already installed in the global R-installation (version 1.1-12) whereas the installation above will install a newer version (1.1-13 or newer). When you load the package with library(lme4), the installation in your $HOME folder will be used. You can verify this with the R-command

> sessionInfo()
  ... lme4_1.1-13 ...

The next call library(car) will load the globally installed package car. So you do not need to install every package in your $HOME, only those packages that are missing or when you require an updated version.

If you encounter problems when installing an R package in your $HOME, for example because a non-R dependency is missing, then please contact Scientific Computing. Also note, that some package may require to use the installer from the BioConductor package.

Using multiple R versions

In case you are planning to use multiple R versions or want to migrate from one version to the next, you may have to reinstall the packages in your own personal library. For example, going from R/3.5.2 to R/3.6.1 or R/4.0.2 probably requires to reinstall all packages (an update to R/3.5.3, i.e. a bugfix release, should not be a problem). If required, you can also keep packages for multiple versions, in which case you should create directories of the form $HOME/R/x.y/lib, e.g.

$ mkdir -pv $HOME/R/4.0/lib

and then set

R_LIBS_USER="~/R/%v/lib"

in $HOME/.Renviron. The %v will be expanded the version of R that you are using and allows you to have multiple lib-directories.

Using R on the HPC cluster

If you want to use R on the HPC cluster, you will have to load its module. You can do that by using the command

module load R

Since R is installed on multiple environments and in different versions, possibly you will have to change environment and specify the version

module load hpc-env/6.4
module load R/3.5.2-intel-2018a

Basic Job Script for R

Suppose you want to create 100 random numbers and calculate their mean and standard deviation. In R the commands for that would be:

x <- runif(100,0.0,1.0)
mean(x)
sd(x)

If you want to do the calculation on the cluster store the above commands in a file named e.g. Rtest.R. Then create a job script Rtest.sh with the content:

#!/bin/bash

#SBATCH --job-name=Rtest
#SBATCH --partition=carl.p
#SBATCH --time=24:00:0
#SBATCH --mem=5000M
   
# load modules
module load R

# run R 
Rscript ./Rtest.R

and submit a job with the command:

sbatch Rtest.sh

The output of R will appear in a file called slurm-<jobid>.out once the job has been completed. Instead of the command

Rscript ./Rtest.R

you can also use

R CMD BATCH ./Rtest.R

in which case you would find (a slightly different output) in a file called Rtest.Rout. Try out the different commands (and maybe also additional options that can be passed) to see which serves your needs best.

Using batchtools

In some situations you may need to run the same R-program multiple times. This can be achieved with the approach described below using foreach and doMPI. Another option is the R-package batchtools as described on this page

Usage of R and MPI

For parallelization the package doMPI is installed. To launch an parallel R script inside a SLURM script please use command line

  mpirun R --slave -f SCRIPTNAME SCRIPT_CMDLINE_OPTIONS

to enable SLURM to control all processes of your script. Please do not use the batch starting sequence R CMD BATCH!

The corresponding parallel environment in the SLURM submission script is specified by

 #SBATCH --ntasks=NUMBER_OF_TASKS

Note for doMPI:

  • Before you start R with the mpirun command you have to unset the environment variable R_PROFILE in your SLURM-Script. Otherwise the MPI processes were not spawned. Please add following line to your jobscript:
  unset R_PROFILE
  • Please use mpi.quit() at the end of your script. Otherwise it will not end.
  • Here a small example R script for doMPI (it writes the current rank of MPI in b):
#!/usr/bin/env Rscript
#
# file name: test_dompi.R
#

library("doMPI")

# doMPI start
cl <- startMPIcluster()
registerDoMPI(cl)

# parallel foreach due to %dopar% using the MPI cluster
# note that one MPI process is the master (rank 0)and 
# distributes the work (iterations) among the other 
# processes (ranks 1 to (mpi.comm.size(0)-1))
# rnorm returns a vector with three elements, the
# option .combine="rbind" makes a table with 10 rows 
b<-foreach(i=1:10, .combine="rbind") %dopar% {
   my_rank<-as.integer(mpi.comm.rank(0))
   rnorm(3, my_rank, 0.01)  # return three random values near my_rank
}
closeCluster(cl)
print(b)

mpi.quit()

and he corresponding SLURM-script

#!/bin/bash

#SBATCH --job-name=test_dompi
#SBATCH --time=24:00:0
#SBATCH --mem-per-cpu=2G
#SBATCH --output=dompi-test.%j.out        
#SBATCH --error=dompi-test.%j.err 
#SBATCH --ntasks=4
   
# load modules
module load hpc-env/8.3
module load R

# unset the environment variable which is needed for Rmpi
# but makes problems with doMPI
unset R_PROFILE
  
# run R in parallel (mpirun knows the number of tasks requested)
mpirun R --slave -f ./test_dompi.R

Usage of NetCDF and R

A package for NetCDF has been installed together with R. In order to use it, please add the command

module load netCDF

to your job script before starting R. Your R-script should include a line

library(ncdf)

to load the NetCDF library. Please refer to the documentations of NetCDF and R for more informations.


Documentation

You can look up anything about R on their