Submitting Jobs
Summary
Here we give details on job submission for various kinds of jobs in both batch (i.e., non-interactive or background) mode and interactive mode.
In addition to the key options of account, partition, and time limit (see below), your job script files can also contain options to request various numbers of cores, nodes, and/or computational tasks. And there are a variety of additional options you can specify in your batch files, if desired, such as email notification options when a job has completed. These are all described further below.
Key Options for Job Submissions¶
When submitting a job, the three key options required are the account you are submitting under, the partition you are submitting to, and a maximum time limit for your job. Under some circumstances, a Quality of Service (QoS) is also needed.
Here are more details on the key options:
Each job must be submitted as part of an account, which determines which resources you have access to and how your use will be charged. Note that this account name, which you will use in your SLURM job script files, is different from your Linux account name on the cluster. For instance, for a hypothetical example user lee
who has access to both the physics
condo and to a Faculty Computing Allowance, their available accounts for running jobs on the cluster might be named co_physics
and fc_lee
, respectively. (See below for a command that you can run to find out what account name(s) you can use in your own job script files.) Users with access to only a single account (often an FCA) do not need to specify an account.
Each job must be submitted to a particular partition, which is a collection of similar or identical machines that your job will run on. The different partitions on Savio include older or newer generations of standard compute nodes, "big memory" nodes, nodes with Graphics Processing Units (GPUs), etc. (See below for a command that you can run to find out what partitions you can use in your own job script files.) For most users, the combination of the account and partition the user chooses will determine the constraints set on their job, such as job size limit, time limit, etc. Jobs submitted within a partition will be allocated to that partition's set of compute nodes based on the scheduling policy, until all resources within that partition are exhausted.
A maximum time limit for the job is required under all conditions. When running your job under a QoS that does not have a time limit (such as jobs submitted by the users of most of the cluster's Condos under their priority access QoS), you can specify a sufficiently long time limit value, but this parameter should not be omitted. Jobs submitted without providing a time limit will be rejected by the scheduler.
A QoS is a classification that determines what kind of resources your job can use. For most users, your use of a given account and partition implies a particular QoS, and therefore most users do not need to specify a QoS for standard computational jobs. However there are circumstances where a user would specify the QoS. For instance, there is a QoS option that you can select for running test jobs when you're debugging your code, which further constrains the resources available to your job and thus may reduce its cost. As well, Condo users can select a "lowprio" QoS which can make use of unused resources on the cluster, in exchange for these jobs being subject to termination when needed, in order to free resources for higher priority jobs. (See below for a command that you can run to find out what QoS options you can use in your own job script files.)
Reasons for Job Submission Failures¶
Two common reasons a job may not be submitted successfully are:
- "Invalid account or account/partition combination specified": This indicates you're trying to use an account or account-partition pair that doesn't exist or you don't have access to.
- "This user/account pair does not have enough service units": This indicates your FCA is out of service units, either because they are all used up for the year or because the FCA was not renewed at the start of the yearly cycle in June.
Batch Jobs¶
When you want to run one of your jobs in batch (i.e. non-interactive or background) mode, you'll enter an sbatch
command. As part of that command, you will also specify the name of, or filesystem path to, a SLURM job script file; e.g., sbatch myjob.sh
A job script specifies where and how you want to run your job on the cluster, and ends with the actual command(s) needed to run your job. The job script file looks much like a standard shell script (.sh
) file, but also includes one or more lines that specify options for the SLURM scheduler; e.g.
#SBATCH --some-option-here
Although these lines start with hash signs (#
), and thus are regarded as comments by the shell, they are nonetheless read and interpreted by the SLURM scheduler.
Here is a minimal example of a job script file that includes the required account, partition, and time options. It will run unattended for up to 30 seconds on one of the compute nodes in the partition_name
partition, and will simply print out the words, "hello world":
#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run:
echo "hello world"
In this and other examples, account_name
and partition_name
are placeholders for actual values you will need to provide. See Key options to set, above, for information on what values to use in your own job script files.
See Job submission with specific resource requirements, below, for a set of example job script files, each illustrating how to run a specific type of job.
See Finding output to learn where output from running your batch jobs can be found. If errors occur when running your batch job, this is the first place to look for these.
Interactive Jobs¶
In some instances, you may need to use software that requires user interaction, rather than running programs or scripts in batch mode. To do so, you must first start an instance of an interactive shell on a Savio compute node, within which you can then run your software on that node. To run such an interactive job on a compute node, you'll use srun
. Here is a basic example that launches an interactive 'bash' shell on that node, and includes the required account and partition options:
[user@ln001 ~]$ srun --pty -A account_name -p partition_name -t 00:00:30 bash -i
Once your job starts, the Linux prompt will change and indicate you are on a compute node rather than a login node:
srun: job 669120 queued and waiting for resources
srun: job 669120 has been allocated resources
[user@n0047 ~]
CGRL Jobs¶
- The settings for a job in Vector (Note: you don't need to set the "account"):
--partition=vector --qos=vector_batch
- The settings for a job in Rosalind (Savio2 HTC):
--partition=savio2_htc --account=co_rosalind --qos=rosalind_htc2_normal
Job Submission With Specific Resource Requirements¶
This section details options for specifying the resource requirements for you jobs. We also provide a variety of example job scripts for setting up parallelization, low-priority jobs, jobs using fewer cores than available on a node, and long-running FCA jobs.
Per-core versus Per-node Scheduling¶
In many Savio partitions, nodes are assigned for exclusive access by your job. So, if possible, when running jobs in those partitions you generally want to set SLURM options and write your code to use all the available resources on the nodes assigned to your job (e.g., either 32 or 40 cores and 96 GB memory per node in the "savio3" partition).
The exceptions are the "HTC" and "GPU" partitions: savio2_htc
, savio3_htc
, savio4_htc
, savio3_gpu
, and savio4_gpu
, where individual cores are assigned to jobs.
Savio is transitioning to per-core scheduling
With the Savio4 generation of hardware, we are moving away from per-node scheduling and towards per-core scheduling. There is no savio4
partition, just savio4_htc
.
To request all the memory on a node in the per-core-scheduled partitions, you can use the --exclusive
Slurm flag.
Memory Available¶
Do not request memory explicitly
In most cases, please do not request memory using the memory-related flags available for sbatch
and srun
.
Per-node partitions¶
In partitions in which jobs are allocated entire nodes, by default the full memory on the node(s) will be available to your job. There is no need to pass any memory-related flags when you start your job.
Per-core (HTC and GPU) partitions¶
On the GPU and HTC partitions you get an amount of memory proportional to the number of cores your job requests relative to the number of cores on the node. For example, if the node has 64 GB and 8 cores, and you request 2 cores, you'll have access to 16 GB memory (1/4 of 64 GB).
If you need more memory than the default memory provided per core, you should request additional cores.
The savio4_htc
partition has some nodes with 256 GB and some with 512 GB. By default, regardless of which node you end up on, your job will be allocated 4 GB per core. To request more memory, you have a few options:
- You can request more cores -- asking for enough cores so that four times that many cores gives enough memory.
- You can request that your job use the 512 GB nodes by adding
-C savio4_m512
. Your job will be allocated 8GB per core. If you need more, you should request more cores. - Users of savio4_htc condos may request more than the default memory per core by using the
--mem-per-cpu
flag. However, such users should be aware that this will reduce the resources available for use by other jobs in the condo. For an extreme example, a condo job requesting a few cores and all the memory purchased by the condo will prevent any other jobs from running in the condo, because the memory available to the condo is fully allocated. Thus, there is little difference between using--mem-per-cpu
and simply requesting additional cores when runningsavio4_htc
condo jobs. The--mem
flag should only be used under special circumstances, as it requests memory per node and users don't generally control the number of cores allocated per node.
Parallelization¶
When submitting parallel code, you usually need to specify the number of tasks, nodes, and CPUs to be used by your job in various ways. For any given use case, there are generally multiple ways to set the options to achieve the same effect; these examples try to illustrate what we consider to be best practices.
The key options for parallelization are:
--nodes
(or-N
): indicates the number of nodes to use--ntasks-per-node
: indicates the number of tasks (i.e., processes one wants to run on each node)--cpus-per-task
(or-c
): indicates the number of cpus to be used for each task
In addition, in some cases it can make sense to use the --ntasks
(or -n
) option to indicate the total number of tasks and let the scheduler determine how many nodes and tasks per node are needed. In general --cpus-per-task will be 1 except when running threaded code.
Note that if the various options are not set, SLURM will in some cases infer what the value of the option needs to be given other options that are set and in other cases will treat the value as being 1. So some of the options set in the example below are not strictly necessary, but we give them explicitly to be clear.
Here's an example script that requests an entire Savio3 node and specifies 32 cores per task.
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=savio3
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --time=00:00:30
## Command(s) to run:
echo "hello world"
Only the partition, time, and account flags are required.
GPU Jobs¶
Please see example job script for such jobs.
The key things to remember are:
- Submit to a partition with nodes with GPUs (e.g.,
savio3_gpu
). - Include the
--gres
flag. - Request multiple CPU cores for each GPU requested, using
--cpus-per-task
, based on the GPU type, using the ratios given in the table below. - You can request multiple GPUs with syntax like this (in this case for two GPUs):
--gres=gpu:2
. - You can request a particular type of GPU with syntax like this (in this case requesting two A5000 RTX GPUs):
--gres=gpu:A5000:2
.- This is required for most condos in
savio3_gpu
andsavio4_gpu
(when running regular condo jobs and not low priority jobs) with the GPU types available for specific condos detailed here. - This is also required for regular FCA jobs in
savio3_gpu
andsavio4_gpu
(when not using the low priority queue) as discussed here.
- This is required for most condos in
- Also, if using an FCA to request an A40 or V100 GPU, you also need to specifically specify the QoS via
-q a40_gpu3_normal
or-q v100_gpu3_normal
, respectively.
This table shows the ratio of CPU cores to GPUs that you should request when submitting GPU jobs, as well as the GPU types available:
partition | GPU type | GPUs per node | CPU:GPU ratio | FCA QoS available |
---|---|---|---|---|
savio2_1080ti | 1080TI | 4 | 2:1 | savio_normal, savio_lowprio |
savio3_gpu | GTX2080TI | 4 | 2:1 | gtx2080_gpu3_normal, savio_lowprio |
savio3_gpu | TITAN | 8 | 4:1 | savio_lowprio |
savio3_gpu | V100 | 2 | 4:1 | v100_gpu3_normal, savio_lowprio |
savio3_gpu | A40 | 2 | 8:1 | a40_gpu3_normal, savio_lowprio |
savio4_gpu | A5000 | 8 | 4:1 | a5k_gpu4_normal, savio_lowprio |
savio4_gpu | L40 | 8 | 8:1 | savio_lowprio |
Submitting jobs to savio3_gpu
is a bit complicated because the savio3_gpu
partition contains a variety of GPU types, as indicated in the relevant rows above.
Note that only a number of GPUs equivalent to a subset of the savio3_gpu
and savio4_gpu
nodes are available for regular priority FCA use:
- 28 GTX2080TI GPUs
- 2 V100 GPUs
- 16 A40 GPUs
- 136 A5000 GPUs (savio4_gpu)
Job not starting because of QOSGrpCpuLimit
or QOSGrpGRES
If you've submitted a job to savio3_gpu
or savio4_gpu
under an FCA and squeue
indicates it is pending with the REASON
of QOSGrpCpuLimit
or QOSGrpGRES
, that indicates that all of the GPUs of the type you requested (or the CPUs that are allocated proportional to GPUs) that are available for FCA use are already being used.
Job not starting because of QOSMinGRES
If you've submitted a job to savio3_gpu
or savio4_gpu
under an FCA and squeue
indicates it is pending with the REASON
of QOSMinGRES
, that indicates that you forgot to provide the GPU type in your --gres
flag, or that you've requested a GPU type not available for FCA use apart from low-priority use (currently this should only apply to TITAN GPUs in savio3_gpu
), or that you haven't requested a QoS when requesting A40 or V100 GPU on savio3_gpu
.
Additional savio3_gpu
GPUs (including TITAN GPUs) and savio4_gpu
GPUs can be accessed by FCA users through the low priority queue.
Do not use or modify `CUDA_VISIBLE_DEVICES
The environment variable CUDA_VISIBLE_DEVICES
will be set to reference the physical GPUs available to your job. You should not modify this variable as it would cause conflicts with GPU usage by other users on the node. Furthermore, in your code, you should refer to the GPU starting with 0 (and then 1 if you request two GPUs, etc.). The library you are using (e.g., PyTorch) should remap from the ID values in CUDA_VISIBLE_DEVICES to values starting at 0.
Low Priority Jobs (Condos only)¶
As a condo contributor, are entitled to use the extra resource that is available on the SAVIO cluster (across all partitions). This is done through a low priority QoS "savio_lowprio" and your account is automatically subscribed to this QoS during the account creation stage. You do not need to request it explicitly. By using this QoS you are no longer limited by your condo size. What this means to users is that you will now have access to the broader compute resource which is limited by the size of partitions. The per-job limit is 24 nodes and 3 days runtime. Additionally, these jobs can be run on all types and generations of hardware resources in the Savio cluster, not just those matching the condo contribution. At this time, there is no limit or allowance applied to this QoS, so condo contributors are free to use as much of the resource as they need.
However this QoS does not get a priority as high as the general QoSs, such as "savio_normal" and "savio_debug", or all the condo QoSs, and it is subject to preemption when all the other QoSs become busy. Thus it has two implications:
- When system is busy, any job that is submitted with this QoS will be pending and yield to other jobs with higher priorities.
- When system is busy and there are higher priority jobs pending, the scheduler will preempt jobs that are running with this lower priority QoS. At submission time, the user can choose whether a preempted jobs should simply be killed or should be automatically requeued after it's killed. Please note that, since preemption could happen at any time, it is very beneficial if your job is capable of checkpointing/restarting by itself, when you choose to requeue the job. Otherwise, you may need to verify data integrity manually before you want to run the job again. You can request that a preempted job be automatically resubmitted (placed back in the queue by including the
--requeue
flag when submitting your job.
We provide an example job script for such jobs
Long Running Jobs¶
Most jobs running under an FCA have a maximum time limit of 72 hours (three days). However users can run jobs using a small number of cores in the long queue, using the savio2_htc
partition and the savio_long
QoS.
A given job in the long queue can use no more than 4 cores and a maximum of 10 days. Collectively across the entire Savio cluster, at most 24 cores are available for long-running jobs, so you may find that your job may sit in the queue for a while before it starts.
We provide an example job script for such jobs
Tip
Most condos do not have a time limit. Whether there is a limit is up to the PI who owns the condo.
Additional Job Submission Options¶
Here are some additional options that you can incorporate as needed into your own scripts. For the full set of available options, please see the SLURM documentation on the sbatch command.
Finding Output¶
Output from running a SLURM batch job is, by default, placed in a log file named slurm-%j.out
, where the job's ID is substituted for %j
; e.g. slurm-478012.out
This file will be created in your current directory; i.e. the directory from within which you entered the sbatch command. Also by default, both command output and error output (to stdout and stderr, respectively) are combined in this file.
To specify alternate files for command and error output use:
--output
: destination file for stdout--error
: destination file for stderr
Email Options¶
By specifying your email address, the SLURM scheduler can email you when the status of your job changes. Valid options are BEGIN, END, FAIL, REQUEUE, and ALL, and multiple options can be separated by commas.
The required options for email notifications are:
--mail-type
: when you want to be notified--mail-user
: your email address
Email Example¶
Here's a full example showing output and email options.
#!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=savio3_htc #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 #SBATCH --time=00:00:30 #SBATCH --output=test_job_%j.out #SBATCH --error=test_job_%j.err #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=jessie.doe@berkeley.edu ## Command(s) to run: echo "hello world"
QoS Options¶
While your job will use a default QoS, generally savio_normal
, you can specify a different QoS, such as the debug QoS for short jobs, using the --qos
flag, e.g.,
#!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=savio3_htc #SBATCH --qos=savio_debug #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 ##SBATCH --time=00:00:30 ## Command(s) to run: echo "hello world"
Running Many Tasks in Parallel in One Job¶
Many users have multiple jobs that each use only a single core or a small number of cores and therefore cannot take advantage of all the cores on a Savio node. There are two tools that allow one to automate the parallelization of such jobs, in particular allowing one to group jobs into a single SLURM submission to take advantage of the multiple cores on a given Savio node.
For this purpose, we recommend the use of the community-supported GNU parallel tool. One can instead use Savio's HT Helper tool, but for users not already familiar with either tool, we recommend GNU parallel.
Array Jobs¶
Job arrays allow many jobs to be submitted simultaneously with the same requirements. Within each job the environment variable $SLURM_ARRAY_TASK_ID
is set and can be used to alter the execution of each job in the array. Note that as is true for regular jobs, each job in the array will be allocated one or more entire nodes (except for the HTC or GPU partitions), so job arrays are not a way to bundle multiple tasks to run on a single node.
By default, output from job arrays is placed in a series of log files named slurm-%A_%a.out
, where %A
is the overall job ID and %a
is the task ID.
For example, the following script would write "I am task 0" to array_job_XXXXXX_task_0.out
, "I am task 1" to array_job_XXXXXX_task_1.out
, etc, where XXXXXX is the job ID.
#!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=savio3_htc #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=4 #SBATCH --time=00:00:30 #SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID"
Checkpointing/Restarting your Jobs¶
Checkpointing/restarting is the technique where the applications’ state is stored in the filesystem, allowing the user to restart computation from this saved state in order to minimize loss of computation time, for example, when the job reaches its allowed wall clock limit, the job is preempted, or when software/hardware faults occur, etc. By checkpointing or saving intermediate results frequently the user won't lose as much work if their jobs are preempted or otherwise terminated prematurely for some reason.
If you wish to do checkpointing, your first step should always be to check if your application already has such capabilities built-in, as that is the most stable and safe way of doing it. Applications that are known to have some sort of native checkpointing include: Abaqus, Amber, Gaussian, GROMACS, LAMMPS, NAMD, NWChem, Quantum Espresso, STAR-CCM+, and VASP.
In case your program does not natively support checkpointing, it may also be worthwhile to consider utilizing generic checkpoint/restart solutions if and as needed that should work application-agnostic. One example of such a solution is DMTCP: Distributed MultiThreaded CheckPointing. DMTCP is a checkpoint-restart solution that works outside the flow of user applications, enabling their state to be saved without application alterations. You can find its reference quick-start documentation here.
Migrating from Other Schedulers to SLURM¶
We provide some tips on migrating to SLURM for users familiar with Torque/PBS.
For users coming from other schedulers, such as Platform LSF, SGE/OGE, Load Leveler, please use this link to find a quick reference.