Submitting Jobs
Summary
Here we give details on job submission for various kinds of jobs in both batch (i.e., non-interactive or background) mode and interactive mode.
In addition to the key options of account, partition, and time limit (see below), your job script files can also contain options to request various numbers of cores, nodes, and/or computational tasks. And there are a variety of additional options you can specify in your batch files, if desired, such as email notification options when a job has completed. These are all described further below.
Key Options for Job Submissions¶
When submitting a job, the three key options required are the account you are submitting under, the partition you are submitting to, and a maximum time limit for your job. Under some circumstances, a Quality of Service (QoS) is also needed.
Here are more details on the key options:
Each job must be submitted as part of an account, which determines which resources you have access to and how your use will be charged. Note that this account name, which you will use in your SLURM job script files, is different from your Linux account name on the cluster. For instance, for a hypothetical example user lee
who has access to both the physics
condo and to a Faculty Computing Allowance, their available accounts for running jobs on the cluster might be named co_physics
and fc_lee
, respectively. (See below for a command that you can run to find out what account name(s) you can use in your own job script files.) Users with access to only a single account (often an FCA) do not need to specify an account.
Each job must be submitted to a particular partition, which is a collection of similar or identical machines that your job will run on. The different partitions on Savio include older or newer generations of standard compute nodes, "big memory" nodes, nodes with Graphics Processing Units (GPUs), etc. (See below for a command that you can run to find out what partitions you can use in your own job script files.) For most users, the combination of the account and partition the user chooses will determine the constraints set on their job, such as job size limit, time limit, etc. Jobs submitted within a partition will be allocated to that partition's set of compute nodes based on the scheduling policy, until all resources within that partition are exhausted.
A maximum time limit for the job is required under all conditions. When running your job under a QoS that does not have a time limit (such as jobs submitted by the users of most of the cluster's Condos under their priority access QoS), you can specify a sufficiently long time limit value, but this parameter should not be omitted. Jobs submitted without providing a time limit will be rejected by the scheduler.
A QoS is a classification that determines what kind of resources your job can use. For most users, your use of a given account and partition implies a particular QoS, and therefore most users do not need to specify a QoS for standard computational jobs. However there are circumstances where a user would specify the QoS. For instance, there is a QoS option that you can select for running test jobs when you're debugging your code, which further constrains the resources available to your job and thus may reduce its cost. As well, Condo users can select a "lowprio" QoS which can make use of unused resources on the cluster, in exchange for these jobs being subject to termination when needed, in order to free resources for higher priority jobs. (See below for a command that you can run to find out what QoS options you can use in your own job script files.)
Batch Jobs¶
When you want to run one of your jobs in batch (i.e. non-interactive or background) mode, you'll enter an sbatch
command. As part of that command, you will also specify the name of, or filesystem path to, a SLURM job script file; e.g., sbatch myjob.sh
A job script specifies where and how you want to run your job on the cluster, and ends with the actual command(s) needed to run your job. The job script file looks much like a standard shell script (.sh
) file, but also includes one or more lines that specify options for the SLURM scheduler; e.g.
#SBATCH --some-option-here
Although these lines start with hash signs (#
), and thus are regarded as comments by the shell, they are nonetheless read and interpreted by the SLURM scheduler.
Here is a minimal example of a job script file that includes the required account, partition, and time options. It will run unattended for up to 30 seconds on one of the compute nodes in the partition_name
partition, and will simply print out the words, "hello world":
#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run:
echo "hello world"
In this and other examples, account_name
and partition_name
are placeholders for actual values you will need to provide. See Key options to set, above, for information on what values to use in your own job script files.
See Job submission with specific resource requirements, below, for a set of example job script files, each illustrating how to run a specific type of job.
See Finding output to learn where output from running your batch jobs can be found. If errors occur when running your batch job, this is the first place to look for these.
Interactive Jobs¶
In some instances, you may need to use software that requires user interaction, rather than running programs or scripts in batch mode. To do so, you must first start an instance of an interactive shell on a Savio compute node, within which you can then run your software on that node. To run such an interactive job on a compute node, you'll use srun
. Here is a basic example that launches an interactive 'bash' shell on that node, and includes the required account and partition options:
[user@ln001 ~]$ srun --pty -A account_name -p partition_name -t 00:00:30 bash -i
Once your job starts, the Linux prompt will change and indicate you are on a compute node rather than a login node:
srun: job 669120 queued and waiting for resources
srun: job 669120 has been allocated resources
[user@n0047 ~]
CGRL Jobs¶
- The settings for a job in Vector (Note: you don't need to set the "account"): ``` --partition=vector --qos=vector_batch ````
- The settings for a job in Rosalind (Savio1): ``` --partition=savio --account=co_rosalind --qos=rosalind_savio_normal ````
- The settings for a job in Rosalind (Savio2 HTC):
--partition=savio2_htc --account=co_rosalind --qos=rosalind_htc2_normal
Job Submission With Specific Resource Requirements¶
This section details options for specifying the resource requirements for you jobs. We also provide a variety of example job scripts for setting up parallelization, low-priority jobs, jobs using fewer cores than available on a node, and long-running FCA jobs.
Remember that nodes are assigned for exclusive access by your job, except in the "savio2_htc" and "savio2_gpu" partitions. So, if possible, you generally want to set SLURM options and write your code to use all the available resources on the nodes assigned to your job (e.g., 20 cores and 64 GB memory per node in the "savio" partition).
Memory Available¶
Also note that in all partitions except for GPU and HTC partitions, by default the full memory on the node(s) will be available to your job. On the GPU and HTC partitions you get an amount of memory proportional to the number of cores your job requests relative to the number of cores on the node. For example, if the node has 64 GB and 8 cores, and you request 2 cores, you'll have access to 16 GB memory. If you need more memory than that, you should request additional cores.
Tip
Please do not request memory using the memory-related flags available for sbatch and srun.
Parallelization¶
When submitting parallel code, you usually need to specify the number of tasks, nodes, and CPUs to be used by your job in various ways. For any given use case, there are generally multiple ways to set the options to achieve the same effect; these examples try to illustrate what we consider to be best practices.
The key options for parallelization are:
--nodes
(or-N
): indicates the number of nodes to use--ntasks-per-node
: indicates the number of tasks (i.e., processes one wants to run on each node)--cpus-per-task
(or-c
): indicates the number of cpus to be used for each task
In addition, in some cases it can make sense to use the --ntasks
(or -n
) option to indicate the total number of tasks and let the scheduler determine how many nodes and tasks per node are needed. In general --cpus-per-task will be 1 except when running threaded code.
Note that if the various options are not set, SLURM will in some cases infer what the value of the option needs to be given other options that are set and in other cases will treat the value as being 1. So some of the options set in the example below are not strictly necessary, but we give them explicitly to be clear.
Here's an example script that requests an entire Savio node and specifies 20 cores per task.
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=savio
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --time=00:00:30
## Command(s) to run:
echo "hello world"
Only the partition, time, and account flags are required.
GPU Jobs¶
Please see example job script for such jobs.
The key things to remember are:
- Submit to a partition with nodes with GPUs (e.g.,
savio3_gpu
). - Include the
--gres
flag. - Request at least two CPUs for each GPU requested, using
--cpus-per-task
(except when using TITAN RTX, A40, and V100 nodes insavio3_gpu
, in which case you should generally use the ratios in the table below). - You can request multiple GPUs with syntax like this (in this case for two GPUs):
--gres=gpu:2
. - You can request a particular type of GPU with syntax like this (in this case requesting two TITAN RTX GPUs):
--gres=gpu:TITAN:2
.- Note that this is only relevant for
savio3_gpu
, for which the available types are GTX2080TI, TITAN, V100, and A40. - This is required for most condos in
savio3_gpu
(when running regular condo jobs and not low priority jobs) as detailed here. - This is also required for regular FCA jobs in
savio3_gpu
(when not using the low priority queue) as discussed here.
- Note that this is only relevant for
Submitting jobs to savio3_gpu
is a bit complicated because the savio3_gpu
partition contains a variety of GPU types, as documented here:
GPU type | Number of nodes | GPUs per node | CPU:GPU ratio | FCA QoS available |
---|---|---|---|---|
GTX2080TI | 9 | 4 | 2:1 | gtx2080_gpu3_normal (default), savio_lowprio |
TITAN | 6 | 8 | 4:1 | savio_lowprio |
V100 | 2 | 2 | 4:1 | v100_gpu3_normal, savio_lowprio |
A40 | 16 | 2 | 8:1 | a40_gpu3_normal, savio_lowprio |
Note that only a number of GPUs equivalent to a subset of the savio3_gpu
nodes are available for regular priority FCA use:
- 7 GTX2080TI nodes
- 1 V100 node
- 8 A40 nodes
Tip
If you've submitted a job to savio3_gpu
under an FCA and squeue
indicates it is pending with the REASON
of QOSGrpCpuLimit
, that indicates that all of the GPUs of the type you requested that are available for FCA use are already being used.
Additional savio3_gpu
GPUs (including TITAN GPUs) can be accessed by FCA users through the low priority queue.
Low Priority Jobs (Condos only)¶
As a condo contributor, are entitled to use the extra resource that is available on the SAVIO cluster (across all partitions). This is done through a low priority QoS "savio_lowprio" and your account is automatically subscribed to this QoS during the account creation stage. You do not need to request it explicitly. By using this QoS you are no longer limited by your condo size. What this means to users is that you will now have access to the broader compute resource which is limited by the size of partitions. The per-job limit is 24 nodes and 3 days runtime. Additionally, these jobs can be run on all types and generations of hardware resources in the Savio cluster, not just those matching the condo contribution. At this time, there is no limit or allowance applied to this QoS, so condo contributors are free to use as much of the resource as they need.
However this QoS does not get a priority as high as the general QoSs, such as "savio_normal" and "savio_debug", or all the condo QoSs, and it is subject to preemption when all the other QoSs become busy. Thus it has two implications:
- When system is busy, any job that is submitted with this QoS will be pending and yield to other jobs with higher priorities.
- When system is busy and there are higher priority jobs pending, the scheduler will preempt jobs that are running with this lower priority QoS. At submission time, the user can choose whether a preempted jobs should simply be killed or should be automatically requeued after it's killed. Please note that, since preemption could happen at any time, it is very beneficial if your job is capable of checkpointing/restarting by itself, when you choose to requeue the job. Otherwise, you may need to verify data integrity manually before you want to run the job again. You can request that a preempted job be automatically resubmitted (placed back in the queue by including the
--requeue
flag when submitting your job.
We provide an example job script for such jobs
Long Running Jobs¶
Most jobs running under an FCA have a maximum time limit of 72 hours (three days). However users can run jobs using a small number of cores in the long queue, using the savio2_htc
partition and the savio_long
QoS.
A given job in the long queue can use no more than 4 cores and a maximum of 10 days. Collectively across the entire Savio cluster, at most 24 cores are available for long-running jobs, so you may find that your job may sit in the queue for a while before it starts.
We provide an example job script for such jobs
Tip
Most condos do not have a time limit. Whether there is a limit is up to the PI who owns the condo.
Additional Job Submission Options¶
Here are some additional options that you can incorporate as needed into your own scripts. For the full set of available options, please see the SLURM documentation on the sbatch command.
Finding Output¶
Output from running a SLURM batch job is, by default, placed in a log file named slurm-%j.out
, where the job's ID is substituted for %j
; e.g. slurm-478012.out
This file will be created in your current directory; i.e. the directory from within which you entered the sbatch command. Also by default, both command output and error output (to stdout and stderr, respectively) are combined in this file.
To specify alternate files for command and error output use:
--output
: destination file for stdout--error
: destination file for stderr
Email Options¶
By specifying your email address, the SLURM scheduler can email you when the status of your job changes. Valid options are BEGIN, END, FAIL, REQUEUE, and ALL, and multiple options can be separated by commas.
The required options for email notifications are:
--mail-type
: when you want to be notified--mail-user
: your email address
Email Example¶
Here's a full example showing output and email options.
#!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=savio #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=20 #SBATCH --time=00:00:30 #SBATCH --output=test_job_%j.out #SBATCH --error=test_job_%j.err #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=jessie.doe@berkeley.edu ## Command(s) to run: echo "hello world"
QoS Options¶
While your job will use a default QoS, generally savio_normal
, you can specify a different QoS, such as the debug QoS for short jobs, using the --qos
flag, e.g.,
#!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=savio #SBATCH --qos=savio_debug #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=20 ##SBATCH --time=00:00:30 ## Command(s) to run: echo "hello world"
Running Many Tasks in Parallel in One Job¶
Many users have multiple jobs that each use only a single core or a small number of cores and therefore cannot take advantage of all the cores on a Savio node. There are two tools that allow one to automate the parallelization of such jobs, in particular allowing one to group jobs into a single SLURM submission to take advantage of the multiple cores on a given Savio node.
For this purpose, we recommend the use of the community-supported GNU parallel tool. One can instead use Savio's HT Helper tool, but for users not already familiar with either tool, we recommend GNU parallel.
Array Jobs¶
Job arrays allow many jobs to be submitted simultaneously with the same requirements. Within each job the environment variable $SLURM_ARRAY_TASK_ID
is set and can be used to alter the execution of each job in the array. Note that as is true for regular jobs, each job in the array will be allocated one or more entire nodes (except for the HTC or GPU partitions), so job arrays are not a way to bundle multiple tasks to run on a single node.
By default, output from job arrays is placed in a series of log files named slurm-%A_%a.out
, where %A
is the overall job ID and %a
is the task ID.
For example, the following script would write "I am task 0" to array_job_XXXXXX_task_0.out
, "I am task 1" to array_job_XXXXXX_task_1.out
, etc, where XXXXXX is the job ID.
#!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=savio #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=20 #SBATCH --time=00:00:30 #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID"
Checkpointing/Restarting your Jobs¶
Checkpointing/restarting is the technique where the applications’ state is stored in the filesystem, allowing the user to restart computation from this saved state in order to minimize loss of computation time, for example, when the job reaches its allowed wall clock limit, the job is preempted, or when software/hardware faults occur, etc. By checkpointing or saving intermediate results frequently the user won't lose as much work if their jobs are preempted or otherwise terminated prematurely for some reason.
If you wish to do checkpointing, your first step should always be to check if your application already has such capabilities built-in, as that is the most stable and safe way of doing it. Applications that are known to have some sort of native checkpointing include: Abaqus, Amber, Gaussian, GROMACS, LAMMPS, NAMD, NWChem, Quantum Espresso, STAR-CCM+, and VASP.
In case your program does not natively support checkpointing, it may also be worthwhile to consider utilizing generic checkpoint/restart solutions if and as needed that should work application-agnostic. One example of such a solution is DMTCP: Distributed MultiThreaded CheckPointing. DMTCP is a checkpoint-restart solution that works outside the flow of user applications, enabling their state to be saved without application alterations. You can find its reference quick-start documentation here.
Migrating from Other Schedulers to SLURM¶
We provide some tips on migrating to SLURM for users familiar with Torque/PBS.
For users coming from other schedulers, such as Platform LSF, SGE/OGE, Load Leveler, please use this link to find a quick reference.