Scheduler Examples

Here we show some example job scripts that allow for various kinds of parallelization, jobs that use fewer cores than available on a node, GPU jobs, low-priority condo jobs, and long-running FCA jobs.

1. Threaded/OpenMP job script¶

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Request one node:
#SBATCH --nodes=1
#
# Specify one task:
#SBATCH --ntasks-per-node=1
#
# Number of processors for single task needed for use case (example):
#SBATCH --cpus-per-task=4
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./a.out

Here --cpus-per-task should be no more than the number of cores on a Savio node in the partition you request. You may want to experiment with the number of threads for your job to determine the optimal number, as computational speed does not always increase with more threads. Note that if --cpus-per-task is fewer than the number of cores on a node, your job will not make full use of the node. Strictly speaking the --nodes and --ntasks-per-node arguments are optional here because they default to 1.

2. Simple multi-core job script (multiple processes on one node)¶

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Request one node:
#SBATCH --nodes=1
#
# Specify number of tasks for use case (example):
#SBATCH --ntasks-per-node=20
#
# Processors per task:
#SBATCH --cpus-per-task=1
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
./a.out

This job script would be appropriate for multi-core R, Python, or MATLAB jobs. In the commands that launch your code and/or within your code itself, you can reference the SLURM_NTASKS environment variable to dynamically identify how many tasks (i.e., processing units) are available to you.

Here the number of CPUs used by your code at at any given time should be no more than the number of cores on a Savio node.

For a way to run many individual jobs on one or more nodes (more jobs than cores), see this information on using GNU parallel.

3. MPI job script¶

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Number of MPI tasks needed for use case (example):
#SBATCH --ntasks=40
#
# Processors per task:
#SBATCH --cpus-per-task=1
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
module load gcc openmpi
mpirun ./a.out

As noted in the introduction, for partitions in Savio2 and Savio3 scheduled on a per-node basis, you probably want to set the number of tasks to be a multiple of the number of cores per node in that partition, thereby making use of all the cores on the node(s) to which your job is assigned.

This example assumes that each task will use a single core; otherwise there could be resource contention amongst the tasks assigned to a node.

Optimizing MPI on savio4_htc using UCX

savio4 nodes use HDR, under which optimal MPI performance will generally be obtained by using UCX. To do so, make sure to load the ucx module and an openmpi module that uses ucx. You'll need to do this both when building MPI-based software and when running it. At the moment, you'll need to use these (non-default) modules:

module load gcc/11.3.0 ucx/1.14.0 openmpi/5.0.0-ucx

4. Alternative MPI job script¶

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Number of nodes needed for use case:
#SBATCH --nodes=2
#
# Tasks per node based on number of cores per node (example):
#SBATCH --ntasks-per-node=20
#
# Processors per task:
#SBATCH --cpus-per-task=1
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
module load gcc openmpi
mpirun ./a.out

This alternative explicitly specifies the number of nodes, tasks per node, and CPUs per task rather than simply specifying the number of tasks and having SLURM determine the resources needed. As before, one would generally want the number of tasks per node to equal a multiple of the number of cores on a node, assuming only one CPU per task.

5. Hybrid OpenMP+MPI job script¶

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Number of nodes needed for use case (example):
#SBATCH --nodes=2
#
# Tasks per node based on --cpus-per-task below and number of cores
# per node (example):
#SBATCH --ntasks-per-node=4
#
# Processors per task needed for use case (example):
#SBATCH --cpus-per-task=5
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
module load gcc openmpi
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpirun ./a.out

Here we request a total of 8 (=2x4) MPI tasks, with 5 cores per task. When using partitions scheduled on a per-node basis, one would generally want to use all the cores on each node (i.e., that --ntasks-per-node multiplied by --cpus-per-task equals the number of cores on a node.

6. Jobs scheduled on a per-core basis (jobs that use fewer cores than available on a node)¶

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=savio3_htc
#
# Number of tasks needed for use case (example):
#SBATCH --ntasks=4
#
# Processors per task:
#SBATCH --cpus-per-task=1
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
./a.out

In the HTC and GPU partitions you are only charged for the actual number of cores used, so the notion of making best use of resources by saturating a node is not relevant.

7. GPU job script¶

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=savio4_gpu
#
# Number of nodes:
#SBATCH --nodes=1
#
# Number of tasks (one for each GPU desired for use case) (example):
#SBATCH --ntasks=1
#
# Processors per task:
# Always at least twice the number of GPUs (GTX2080TI in savio3_gpu)
# Four times the number for TITAN and V100 in savio3_gpu and A5000 in savio4_gpu
# Eight times the number for A40 in savio3_gpu
#SBATCH --cpus-per-task=4
#
#Number of GPUs, this should generally be in the form "gpu:A5000:[1-4] with the type included
#SBATCH --gres=gpu:A5000:1
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
./a.out

Requesting a GPU type in savio3_gpu and savio4_gpu

savio3_gpu and savio4_gpu regular condo jobs (those not using the low priority queue) should request the specific type of GPU bought for the condo as detailed here.

savio3_gpu and savio4_gpu regular FCA jobs (those not using the low priority queue) should request either the GTX2080TI, A40, or V100 GPU type, e.g., --gres=gpu:GTX2080TI:1 for savio3_gpu and the A5000 type for savio4_gpu. Also, if requesting an A40 or V100 GPU, note that you also need to specifically specify the QoS via -q a40_gpu3_normal or -q v100_gpu3_normal, respectively.

To help the job scheduler effectively manage the use of GPUs, your job submission script must request multiple CPU cores for each GPU you use. Jobs submitted that do not request sufficient CPUs for every GPU will be rejected by the scheduler. Please see the table here for the ratio of CPU cores to GPUs.

Here’s how to request two CPUs for each GPU: the total of CPUs requested results from multiplying two settings: the number of tasks (--ntasks=) and CPUs per task (--cpus-per-task=).

For instance, in the above example, one GPU was requested via --gres=gpu:1, and the required total of two CPUs was thus requested via the combination of --ntasks=1 and --cpus-per-task=2 . Similarly, if your job script requests four GPUs via --gres=gpu:4, and uses --ntasks=8, it should also include --cpus-per-task=1 in order to request the required total of eight CPUs.

Note that in the --gres=gpu:n specification, n must be between 1 and the number of GPUs on a single node (which is provided here for the various GPU types). This is because the feature is associated with how many GPUs per node to request. If you wish to use more than the number of GPUs available on a node, your --gres=gpu:type:n specification should include how many GPUs to use per node requested. For example, if you wish to use four savio3_gpu A40 GPUs across two nodes (for which there are either 2 or 4 GPUs per node), your job script should include options to the effect of --gres=gpu:A40:2, --nodes=2, --ntasks=4, and --cpus-per-task=8.

8. Long-running jobs (up to 10 days and 4 cores per job)¶

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# QoS: must be savio_long for jobs &gt; 3 days
#SBATCH --qos=savio_long
#
# Partition:
#SBATCH --partition=savio2_htc
#
# Number of tasks needed for use case (example):
#SBATCH --ntasks=2
#
# Processors per task:
#SBATCH --cpus-per-task=1
#
# Wall clock limit (7 days in this case):
#SBATCH --time=7-00:00:00
#
## Command(s) to run (example):
./a.out

A given job in the long queue can use no more than 4 cores and a maximum of 10 days. Collectively across the entire Savio cluster, at most 24 cores are available for long-running jobs, so you may find that your job may sit in the queue for a while before it starts.

In the savio2_htc pool you are only charged for the actual number of cores used, so the notion of making best use of resources by saturating a node is not relevant.

9. Low-priority jobs¶

Low-priority jobs can only be run using condo accounts. By default any jobs run in a condo account will use the default QoS (generally savio_normal) if not specified. To use the low-priority queue, you need to specify the low-priority QoS, as follows.

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Quality of Service:
#SBATCH --qos=savio_lowprio
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run:
echo "hello world"

You may wish to add #SBATCH --requeue as well so that low-priority jobs that are preempted are automatically resubmitted.