Submitting Jobs
In addition to the key options of account, partition, and time limit (see below), your job script files can also contain options to request various numbers of cores, nodes, and/or computational tasks. And there are a variety of additional options you can specify in your batch files, if desired, such as email notification options when a job has completed. These are all described further below.
Key Options for Job Submissions¶
When submitting a job, the three required options are the account you are submitting under, the partition you are submitting to, and a maximum time limit for your job. Under some circumstances, a Quality of Service (QoS) is also needed.
Here are more details on the key options:
Each job must be submitted as part of an account, which determines which resources you have access to and how your use will be charged. Note that this account name, which you will use in your SLURM job script files, is different from your Linux account name on the cluster. For instance, for a hypothetical example user lee who has access to both the physics condo and to a Faculty Computing Allowance, their available accounts for running jobs on the cluster might be named co_physics and fc_lee, respectively. (See below for a command that you can run to find out what account name(s) you can use in your own job script files.) Users with access to only a single account (often an FCA) do not need to specify an account.
Each job must be submitted to a particular partition, which is a collection of similar or identical machines that your job will run on. The different partitions on Savio include older or newer generations of standard compute nodes, "big memory" nodes, nodes with Graphics Processing Units (GPUs), etc. (See below for a command that you can run to find out what partitions you can use in your own job script files.) For most users, the combination of the account and partition the user chooses will determine the constraints set on their job, such as job size limit, time limit, etc. Jobs submitted within a partition will be allocated to that partition's set of compute nodes based on the scheduling policy, until all resources within that partition are exhausted.
A maximum time limit for the job is required under all conditions. When running your job under a QoS that does not have a time limit (such as jobs submitted by the users of most of the cluster's Condos under their priority access QoS), you can specify a sufficiently long time limit value, but this parameter should not be omitted. Jobs submitted without providing a time limit will be rejected by the scheduler.
A QoS is a classification that determines what kind of resources your job can use. For most users, your use of a given account and partition implies a particular QoS, and therefore most users do not need to specify a QoS for standard computational jobs. However there are circumstances where a user would specify the QoS. For instance, there is a QoS option that you can select for running test jobs when you're debugging your code, which further constrains the resources available to your job and thus may reduce its cost. As well, Condo users can select a "lowprio" QoS which can make use of unused resources on the cluster, in exchange for these jobs being subject to termination when needed, in order to free resources for higher priority jobs. (See below for a command that you can run to find out what QoS options you can use in your own job script files.)
Reasons for Job Submission Failures¶
Two common reasons a job may not be submitted successfully are:
- "Invalid account or account/partition combination specified": This indicates you're trying to use an account or account-partition pair that doesn't exist or you don't have access to.
- "This user/account pair does not have enough service units": This indicates your FCA is out of service units, either because they are all used up for the year or because the FCA was not renewed at the start of the yearly cycle in June.
Batch Jobs¶
When you want to run one of your jobs in batch (i.e. non-interactive or background) mode, you'll enter an sbatch command. As part of that command, you will also specify the name of, or filesystem path to, a SLURM job script file; e.g., sbatch myjob.sh
A job script specifies where and how you want to run your job on the cluster, and ends with the actual command(s) needed to run your job. The job script file looks much like a standard shell script (.sh) file, but also includes one or more lines that specify options for the SLURM scheduler; e.g.
#SBATCH --some-option-here
Although these lines start with hash signs (#), and thus are regarded as comments by the shell, they are nonetheless read and interpreted by the SLURM scheduler.
Here is a minimal example of a job script file that includes the required account, partition, and time options. It will run unattended for up to 30 seconds on one of the compute nodes in the partition_name partition, and will simply print out the words, "hello world":
#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=<account_name>
#
# Partition:
#SBATCH --partition=<partition_name>
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run:
echo "hello world"
In this and other examples, account_name and partition_name are placeholders for actual values you will need to provide. See Key options to set, above, for information on what values to use in your own job script files.
See Job submission with specific resource requirements, below, for a set of example job script files, each illustrating how to run a specific type of job.
See Finding output to learn where output from running your batch jobs can be found. If errors occur when running your batch job, this is the first place to look for these.
Interactive Jobs¶
In some instances, you may need to use software that requires user interaction, rather than running programs or scripts in batch mode. To do so, you must first start an instance of an interactive shell on a Savio compute node, within which you can then run your software on that node. To run such an interactive job on a compute node, you'll use srun. Here is a basic example that launches an interactive 'bash' shell on that node, and includes the required account and partition options:
[user@ln001 ~]$ srun --pty -A account_name -p partition_name -t 00:00:30 bash -i
Once your job starts, the Linux prompt will change and indicate you are on a compute node rather than a login node:
srun: job 669120 queued and waiting for resources
srun: job 669120 has been allocated resources
[user@n0047 ~]
CGRL Jobs¶
- The settings for a job in Vector (Note: you don't need to set the "account"):
--partition=vector --qos=vector_batch - The settings for a job in Rosalind (Savio2 HTC):
--partition=savio2_htc --account=co_rosalind --qos=rosalind_htc2_normal
Array Jobs¶
Job arrays allow many jobs to be submitted simultaneously with the same requirements. Within each job the environment variable $SLURM_ARRAY_TASK_ID is set and can be used to alter the execution of each job in the array. Note that as is true for regular jobs, each job in the array will be allocated one or more entire nodes (except for the HTC or GPU partitions), so job arrays are not a way to bundle multiple tasks to run on a single node.
By default, output from job arrays is placed in a series of log files named slurm-%A_%a.out, where %A is the overall job ID and %a is the task ID.
For example, the following script would write "I am task 0" to array_job_XXXXXX_task_0.out, "I am task 1" to array_job_XXXXXX_task_1.out, etc, where XXXXXX is the job ID.
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=<account_name>
#SBATCH --partition=savio3_htc
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:00:30
<strong>#SBATCH --array=0-31</strong>
<strong>#SBATCH --output=array_job_%A_task_%a.out</strong>
<strong>#SBATCH --error=array_job_%A_task_%a.err</strong>
## Command(s) to run:
echo "I am task $SLURM_ARRAY_TASK_ID"
When running many similar tasks using a small number of resources per task, e.g., just 2 or 4 cores, an alternative to using job arrays would be using a tool like GNU parallel. Especially large job arrays can have negative impacts on the Slurm scheduler, which can be avoided by submitting larger jobs which request more resources at once and instead splitting up the resources amongst tasks within the job itself. These techniques can also be used in conjunction, which can be a useful compromise for adapting and running complex workflows with many tasks.
Low Priority Jobs¶
On Savio, some jobs can be submitted to a low priority QoS, savio_lowprio, to help increase active utilization of the entire cluster. This QoS does not get a priority as high as the general QoSs, meaning that the scheduler will take longer to allocate resources to these jobs than jobs using savio_normal, savio_debug, or the condo QoSs. These jobs are further subjeceted to preemption when all the other QoSs become busy. Thus it has two implications:
-
When system is busy, any job that is submitted with this QoS will be pending and yield to other jobs with higher priorities.
-
When system is busy and there are higher priority jobs pending, the scheduler will preempt jobs that are running with this lower priority QoS. At submission time, the user can choose whether a preempted jobs should simply be killed or should be automatically requeued after it's killed. Please note that, since preemption could happen at any time, it is very beneficial if your job is capable of checkpointing/restarting by itself, when you choose to requeue the job. Otherwise, you may need to verify data integrity manually before you want to run the job again. You can request that a preempted job be automatically resubmitted (placed back in the queue by including the
--requeueflag when submitting your job. We provide an example job script for such jobs.
Usage of the savio_lowprio QoS differs between Condo allocations and FCAs. Condo contributors are entitled to use the extra resources that are available on the Savio cluster across all partitions. Condo accounts are automatically subscribed to this QoS during the account creation stage, and it does not need to be requested explicitly. By using this QoS you are no longer limited by your condo size. What this means to users is that you will now have access to the broader compute resource which is limited by the size of partitions. The per-job limit is 24 nodes and 3 days runtime. Additionally, these jobs can be run on all types and generations of hardware resources in the Savio cluster, not just those matching the condo contribution. At this time, there is no limit or allowance applied to this QoS, so condo contributors are free to use as much of the resource as they need.
FCA accounts can only submit jobs to the savio_lowprio QoS on the savio3_gpu and savio4_gpu partitions. Similar to the condo job limits, the per-job limit for submission is 24 nodes and 3 days of wall time.
Checkpointing and Restarting your Jobs¶
Checkpointing is a technique where an application's state is stored in the filesystem, allowing a program to restart computation from this saved state in order to minimize loss of computation time, which in turn makes the job more fault tolerant. This can be useful across a wide variety of cases when submitting expensive jobs via Slurm, as programs can be interrupted when a job reaches its allowed wall clock limit, is preempted, or when software/hardware faults occur. By checkpointing or saving intermediate results frequently, the user won't lose as much work if their jobs terminate prematurely. Resuming a calculation from an intermediate state can also be useful in enabling longer running calculations, e.g., the training of a machine learning model, by extending a computation across a series of jobs via access to a common set of checkpoints.
If you wish to do checkpointing, your first step should always be to check if your application already has such capabilities built-in, as that is the most stable and safe way of doing it. Applications that are known to have some sort of native checkpointing include: Abaqus, Amber, Gaussian, GROMACS, LAMMPS, NAMD, NWChem, Quantum Espresso, STAR-CCM+, and VASP.
In case your program does not natively support checkpointing, it may also be worthwhile to consider utilizing generic checkpoint/restart solutions if and as needed that should work application-agnostic. One example of such a solution is DMTCP: Distributed MultiThreaded CheckPointing. DMTCP is a checkpoint-restart solution that works outside the flow of user applications, enabling their state to be saved without application alterations. You can find its reference quick-start documentation here.
Additional Job Submission Options¶
Here are some additional options that you can incorporate as needed into your own scripts. For the full set of available options, please see the SLURM documentation on the sbatch command.
Finding Output¶
Output from running a SLURM batch job is, by default, placed in a log file named slurm-%j.out, where the job's ID is substituted for %j; e.g. slurm-478012.out This file will be created in your current directory; i.e. the directory from within which you entered the sbatch command. Also by default, both command output and error output (to stdout and stderr, respectively) are combined in this file.
To specify alternate files for command and error output use:
--output: destination file for stdout--error: destination file for stderr
Email Options¶
By specifying your email address, the SLURM scheduler can email you when the status of your job changes. Valid options are BEGIN, END, FAIL, REQUEUE, and ALL, and multiple options can be separated by commas.
The required options for email notifications are:
--mail-type: when you want to be notified--mail-user: your email address
Email and output example¶
Here's a full example showing output and email options.
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=<account_name>
#SBATCH --partition=savio3_htc
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=00:00:30
#SBATCH --output=test_job_%j.out
#SBATCH --error=test_job_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=jessie.doe@berkeley.edu
## Command(s) to run:
echo "hello world"
QoS Options¶
While your job will use a default QoS, generally savio_normal, you can specify a different QoS, such as the debug QoS for short jobs, using the --qos flag, e.g.,
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=<account_name>
#SBATCH --partition=savio3_htc
#SBATCH --qos=savio_debug
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
##SBATCH --time=00:00:30
## Command(s) to run:
echo "hello world"
Migrating from Other Schedulers to SLURM¶
We provide some tips on migrating to SLURM for users familiar with Torque/PBS.
For users coming from other schedulers, such as Platform LSF, SGE/OGE, Load Leveler, please use this link to find a quick reference.