Using R on Savio¶
We provide R and a variety of commonly-used R packages via the Savio module system.
Caution
Unfortunately, R is configured by default to use slow versions of the BLAS and LAPACK linear algebra libraries (unlike R under the previous Savio SL7 operating system, which used the Intel MKL library for fast and parallel linear algebra operations). As of July 2024, we are working to see if we can configure R to use Intel MKL. See below for more details about parallel linear algebra. Please let us know if this is adversely affecting your work.
Loading R and accessing R packages¶
To access R from the terminal or in a Slurm job, you need to load the R module:
module load r
Older versions are available by loading a specific version, e.g., module load r/4.4.0
. You can see the versions with module avail
. (At the moment only R 4.4.0 is available.)
Many R standard R packages are already provided on the system, such as Rcpp
, ggplot2
, future
, and dplyr
(formerly these were in the r-packages
module but they are now directly available via the r
module itself). The available packages can be found by looking in the /global/software/rocky-8.x86_64/manual/modules/langs/r-packages/r4.4.0
directory.
To load a standard set of packages for spatial data:
module load r-spatial
Installing additional R packages¶
You can also install additional R packages, such as those available on CRAN, that are not already available on the system. You'll need to install them into your home directory or your scratch directory.
!!!tip "Use the system-provided packages when possible
Many R packages have dependencies on R packages already provided on the system, such as Rcpp
, ggplot2
, and dplyr
. If you see that packages available on the system are being installed locally in your own directory, please feel free to get in touch with us to diagnose the problem. For packages related to spatial analysis, it's good practice to stop the installation and go back and load the r-spatial
module before installing the package of interest. This avoids installing a second copy of the dependency.
In the following example, we'll install the fields
package for spatial statistics, which needs to compile some Fortran code as well as pull in some dependency packages. You can either set the directory in which to put the package(s) via the lib
argument or follow the prompts provided by R to accept the default location (generally ~/R/x86_64-pc-linux-gnu-library/4.4
). (If you've already installed packages for this version of R, the default location should already exist.) Here we'll use the default:
install.packages('fields')
Note that if you install them other than in the default location, e.g., via:
install.packages('fields', lib = '/global/scratch/users/yourusername/R')
you will probably need to set the environment variable R_LIBS_USER
to include the non-default location so that R can find the packages. You can set R_LIBS_USER
in your .bashrc
file or, perhaps better, in your ~/.Renviron
file. You can use the .libPaths()
function in R to see where it looks for installed packages. And you can use the searchpaths()
function to see where the loaded packages in your R session are installed.
Tip
In some cases an R package will require an external non-R package as a dependency. If it's available on the system, you may need to load in the relevant Savio module via module load
. If it's not available on the system you may be able to install the dependency yourself from the source code for the dependency, or you can ask us for help.
Running R interactively¶
Using srun
to run R on the command line¶
To use R interactively on Savio's compute nodes, you can use srun
to submit an interactive job.
Once you're working on a compute node, you can then load the R module and start R.
Using RStudio for interactive use via Open OnDemand¶
We now provide access to RStudio via Open OnDemand. This allows you to interact with RStudio from your web browser, but with RStudio running on Savio.
RStudio sessions can be run on compute nodes in any of the Savio partitions. In many cases you may want to use the savio3_htc
or savio4_htc
partitions to use one or a few cores and not be charged for use of a full node.
Parallel processing in R¶
R provides several ways of parallelizing your computations. We describe them briefly here and outline their use below:
- Threaded linear algebra. In the future, R on Savio may be set up to use Intel's MKL package for linear algebra. MKL can automatically use multiple cores on a single machine, as described below.
- Multi-process parallelization on a single node. You can use functions provided in R packages such as
future
,foreach
andparallel
to run independent calculations across multiple cores on a single node. - Multiple nodes. You can use functions provided in R packages such as
future
,foreach
andparallel
to run inependent calculations across multiple cores on multiple nodes.
1. Threaded linear algebra¶
In the future you may be able to make use of threaded (i.e., parallelized) linear algebra simply by running R code that uses R's linear algebra functions.
Warning
Since threaded linear algebra only works on a single node, you shouldn't request multiple nodes and, to be safe, should avoid using --ntasks
if using the HTC partitions (savio3_htc
or savio4_htc
) as it would be possible to end up with multiple cores spread across multiple nodes.
Tip
To verify that R is using MKL, you can run sessionInfo()
in R. You should see a line like this:
BLAS/LAPACK: /global/software/sl-7.x86_64/modules/langs/intel/2016.4.072/compilers_and_libraries_2016.4.258/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
pointing to an MKL shared object file.
Here's an example Slurm job script for a job that uses threaded linear algebra. Basically all you need to do is specify the number of threads you want to use as an environment variable, MKL_NUM_THREADS
. (In fact, by default the linear algebra should use as many threads as possible without you even specifying MKL_NUM_THREADS
.) Then linear algebra operations done in R will use that many cores automatically.
#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Partition:
#SBATCH --partition=partition_name
#
# Request one node:
#SBATCH --nodes=1
#
# Specify one task:
#SBATCH --ntasks-per-node=1
#
# Number of processors for threading:
#SBATCH --cpus-per-task=32
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load r
R CMD BATCH --no-save job.R job.Rout
Note that here we make use of all the cores on the node (32 here, assuming use of the savio3
partition, which contains 32-core nodes) for the threaded linear algebra, but in some cases using too many cores might actually decrease performance, so it may be worth some experimentation with your code to determine the best number of cores. You can also simply set MKL_NUM_THREADS
to a fixed number.
If you want to use a small number of threads and not have your job be charged for unused cores, you may want to run your job on one of Savio's High Throughput Computing (HTC) nodes (e.g., by selecting the savio3_htc
partition) as follows:
Here is an example job script to use this kind of parallelization on an HTC node, here using two cores:
#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Partition:
#SBATCH --partition=savio3_htc
#
# Specify one task:
#SBATCH --ntasks=1
#
# Number of processors for threading:
#SBATCH --cpus-per-task=2
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load r
R CMD BATCH --no-save job.R job.Rout
2. Multi-process parallelization on a single node¶
Using the future package on a single node¶
The future
package provides an elegant interface to run parallel computation across a variety of hardware resources, including a single node or multiple nodes.
As discussed in this tutorial and this vignette, the future package allows one to write one's computational code without hard-coding whether or how parallelization would be done. Instead one writes the code in a generic way and at the top of one's code sets the plan for how the parallel computation should be done given the computational resources available. Simply changing the plan changes how parallelization is done for any given run of the code.
More concisely, the key ideas are:
- Separate what to parallelize from how and where the parallelization is actually carried out.
- Run the same code on different computational resources (without touching the actual code that does the computation).
Here we'll discuss its use on a single node. In this case one can either use the multisession
or multicore
plan.
Tip
The multicore
backend forks the main R process. This creates R worker processes with the same state as the original R process. All objects point back to the original objects in the main process and do not use additional memory, and no copying is involved. However if the worker process(es) modify an object, then a copy needs to be made.
Here's the basic syntax for using the future package with a parallel lapply.
plan(multicore)
future.apply::future_sapply(1:100, function(i) return(i))
plan(multicore)
will use parallelly::availableCores()
to determine the number of workers to start, based on the number of cores requested in your Slurm job submission.
One can also use the future package as the backend for foreach
by using doFuture::registerDoFuture()
.
Using other R parallelization tools on a single node¶
Other functions in R that provide parallelization across multiple cores on a node include parLapply
, mclapply
, and foreach
(using the doParallel
backend).
Tip
mclapply
uses forking to start up the R workers. This saves memory and time because the R objects on the workers point back to the objects in the original R process, unless those objects are modified by the workers. You can also have parLapply
and foreach
use forking by using the (non-default) parallel::makeForkCluster
to start the workers, as discussed in Section 3.1.3 of this tutorial.
Here are the setup steps in R for using the foreach
function:
library(doParallel)
ncores <- as.numeric(Sys.getenv('SLURM_CPUS_ON_NODE'))
registerDoParallel(ncores)
result <- foreach(i = 1:nIts) %dopar% {
# body of loop
}
parLapply
and mclapply
functions, available in the parallel
package.
library(parallel)
ncores <- as.numeric(Sys.getenv('SLURM_CPUS_ON_NODE'))
cl <- makeCluster(ncores)
result <- parSapply(cl, X, FUN)
help(clusterApply)
for more information.
Using mclapply
would look like this:
ncores <- as.numeric(Sys.getenv('SLURM_CPUS_ON_NODE'))
result <- mclapply(X, FUN, ..., mc.cores = ncores)
In some cases the R commands that set up parallelization may recognize the number of cores available on the machine automatically. In many cases however, you will need to read an environment variable such as SLURM_CPUS_ON_NODE
into R and pass that as an argument to the relevant R functions, as shown above.
3. Parallelization on multiple nodes¶
Danger
If in the future the R on Savio is configured to use the MKL library for linear algebra, the MKL module must be loaded on all nodes on which R workers are running. Unfortunately, the only real way to achieve this is to add a line containing module load r
to your .bashrc
file, which will load all modules that R needs, including MKL. This is awkward in that you may not want the R module and related modules loaded in all your Savio sessions. So you may need to comment/uncomment the line in your .bashrc
depending on whether you are using multi-node R parallelization at any given time.
Using the future package on multiple nodes¶
In your Slurm submission, make sure to request as many tasks (using --ntasks
or --ntasks-per-node
) as R workers that you want to use.
Then when using the future package, use the cluster
plan. plan(cluster)
will use parallelly::availableWorkers()
to determine the number of workers to start, based on the resources requested in your Slurm job submission.
plan(cluster)
Alternatively, you could set specify the workers manually. Here we use srun
(note this is being done within our original sbatch
or srun
) to run hostname
once per Slurm task, returning the name of the node the task is assigned to.
workerNodes <- system('srun hostname', intern = TRUE)
plan(cluster, workers = workerNodes)
In either case, we can verify that the workers are running on the various nodes by checking the nodename of each of the workers:
future.apply::future_sapply(seq_len(nbrOfWorkers()), function(i) Sys.info()[["nodename"]])
Using other R parallelization tools on multiple nodes¶
You can run parallel apply statements and foreach across the cores on multiple nodes, provided you set things up so the workers can start on all the nodes.
In your Slurm submission, make sure to request as many tasks (using --ntasks
or --ntasks-per-node
) as R workers that you want to use. Then the key step in R is to give makeCluster
the information about the nodes available. Here we use srun
(note this is being done within our original sbatch
or srun
) to run hostname
once per Slurm task, returning the name of the node the task is assigned to.
workerNodes <- system('srun hostname', intern = TRUE)
cl <- parallel::makeCluster(workerNodes)
Now use the cluster object, cl
in your call to parLapply
or registerDoSNOW
or similar commands.
We recommend using doSNOW
rather than doMPI
as avoiding the use of MPI can simplify things.
Running R jobs on Savio's GPU nodes¶
Savio does not provide any R packages that take advantage of GPUs at the system level. However there are a variety of R packages that allow you to make use of GPUs from within R, including many available on CRAN, as described in the GPU section of this Task View. You'll need to write, adapt, or use R code that has been written for GPU access based on these packages. To install such packages you'll generally need to load in the CUDA module via module load cuda
on a GPU node.
To run R jobs on one or more GPUs, you'll need to request access to the GPU(s) by including the --gres=gpu:x
flag to sbatch
or srun
, where x
is the number of GPUs you need, following our example GPU job script.
Using non-ASCII (non-English) characters and UTF-8¶
If you need to be able to display characters from other languages and more generally a broader array of characters, you can modify the "locale" used for handling characters by setting the LC_CTYPE
shell environment variable before starting R, like this:
export LC_CTYPE=en_US.UTF-8
If you then use UTF-8 characters in R, they should display like this:
'Pe\u00f1a 3\u00f72'
# [1] "Peña 3÷2"