Overview

To submit and run jobs, cancel jobs, and check the status of jobs on the Savio cluster, you’ll use the Simple Linux Utility for Resource Management (SLURM), an open-source resource manager and job scheduling system. (SLURM manages jobs, job steps, nodes, partitions (groups of nodes), and other entities on the cluster.)

There are several basic SLURM commands you’ll likely use often:

  • sbatch - Submit a job to the batch queue system, e.g., sbatch myjob.sh, where myjob.sh is a SLURM job script. (On this page, you can find both a simple, introductory example of a job script, as well as many other examples of job scripts for specific types of jobs you might run. Adapting and modifying one of these examples is the quickest way to get started with running batch jobs.)
  • srun - Submit an interactive job to the batch queue system
  • scancel - Cancel a job, e.g., scancel 123, where 123 is a job ID
  • squeue - Check the current jobs in the batch queue system, e.g., squeue -u $USER to view your own jobs
  • sinfo - View the status of the cluster's compute nodes, including how many nodes - of what types - are currently available for running jobs.

Charges for running jobs

When running your SLURM batch or interactive jobs on the Savio cluster under a Faculty Computing Allowance account (i.e. a scheduler account whose name begins with fc_), your usage of computational time is tracked (in effect, “charged” for, although no costs are incurred) via abstract measurement units called “Service Units.” (Please see Service Units on Savio for a description of how this usage is calculated.) When all of the Service Units provided under an Allowance have been exhausted, no more jobs can be run under that account. To check your usage or total usage under an FCA, please use our check_usage.sh script.

Please also note that, when running jobs on many of Savio’s pools of compute nodes, you are provided with exclusive access to those nodes, and thus are “charged” for using all of that node’s cores. For example, if you run a job for one hour on a standard 24-core compute node on the savio2 partition, your job will always be charged for using 24 core hours, even if it requires just a single core or a few cores. (For more details, including information on ways you can most efficiently use your computational time on the cluster, please see the Scheduling Nodes v. Cores section of Service Units on Savio.)

Usage tracking does not affect jobs run under a Condo account (i.e. a scheduler account whose name begins with co_), which has no Service Unit-based limits.

Key options to set when submitting your jobs

When submitting a job, the three key options required are the account you are submitting under, the partition you are submitting to, and a maximum time limit for your job. Under some circumstances, a Quality of Service (QoS) is also needed.

  • Account: Each job must be submitted as part of an account, which determines which resources you have access to and how your use will be charged. Note that this account name, which you will use in your SLURM job script files, is different from your Linux account name on the cluster. For instance, for a hypothetical example user lee who has access to both the physics condo and to a Faculty Computing Allowance, their available accounts for running jobs on the cluster might be named co_physics and fc_lee, respectively. (See below for a command that you can run to find out what account name(s) you can use in your own job script files.) Users with access to only a single account (often an FCA) do not need to specify an account.
  • Partition: Each job must be submitted to a particular partition, which is a collection of similar or identical machines that your job will run on. The different partitions on Savio inclode older or newer generations of standard compute nodes, "big memory" nodes, nodes with Graphics Processing Units (GPUs), etc. (See below for a command that you can run to find out what partitions you can use in your own job script files.) For most users, the combination of the account and partition the user chooses will determine the constraints set on their job, such as job size limit, time limit, etc. Jobs submitted within a partition will be allocated to that partition's set of compute nodes based on the scheduling policy, until all resources within that partition are exhausted. Currently, on the Savio cluster,
  • QoS: A QoS is a classification that determines what kind of resources your job can use. For most users, your use of a given account and partition implies a particular QoS, and therefore most users do not need to specify a QoS for standard computational jobs. However there are circumstances where a user would specify the QoS. For instance, there is a QoS option that you can select for running test jobs when you're debugging your code, which further constrains the resources available to your job and thus may reduce its cost. As well, Condo users can select a "lowprio" QoS which can make use of unused resources on the cluster, in exchange for these jobs being subject to termination when needed, in order to free resources for higher priority jobs. (See below for a command that you can run to find out what QoS options you can use in your own job script files.)
  • Time: A maximum time limit for the job is required under all conditions. When running your job under a QoS that does not have a time limit (such as jobs submitted by the users of some of the cluster's Condos under their priority access QoS), you can specify a sufficiently long time limit value, but this parameter should not be omitted. Jobs submitted without providing a time limit will be rejected by the scheduler.

You can view the accounts you have access to, partitions you can use, and the QoS options available to you using the sacctmgr command:

sacctmgr -p show associations user=$USER

This will return output such as the following for a hypothetical example user lee who has access to both the physics condo and to a Faculty Computing Allowance. Each line of this output indicates a specific combination of an account, a partition, and QoSes that you can use in a job script file, when submitting any individual batch job:

Cluster|Account|User|Partition|...|QOS|Def QOS|GrpTRESRunMins| brc|co_physics|lee|savio2_1080ti|...|savio_lowprio|savio_lowprio|| brc|co_physics|lee|savio2_knl|...|savio_lowprio|savio_lowprio|| brc|co_physics|lee|savio2_bigmem|...|savio_lowprio|savio_lowprio|| brc|co_physics|lee|savio2_gpu|...|savio_lowprio|savio_lowprio|| brc|co_physics|lee|savio2_htc|...|savio_lowprio|savio_lowprio|| brc|co_physics|lee|savio_bigmem|...|savio_lowprio|savio_lowprio|| brc|co_physics|lee|savio2|...|physics_savio2_normal,savio_lowprio|physics_savio2_normal|| brc|co_physics|lee|savio|...|savio_lowprio|savio_lowprio|| brc|fc_lee|lee|savio2_1080ti|...|savio_debug,savio_normal|savio_normal|| brc|fc_lee|lee|savio2_knl|...|savio_debug,savio_normal|savio_normal|| brc|fc_lee|lee|savio2_bigmem|...|savio_debug,savio_normal|savio_normal|| brc|fc_lee|lee|savio2_gpu|...|savio_debug,savio_normal|savio_normal|| brc|fc_lee|lee|savio2_htc|...|savio_debug,savio_normal|savio_normal|| brc|fc_lee|lee|savio_bigmem|...|savio_debug,savio_normal|savio_normal|| brc|fc_lee|lee|savio2|...|savio_debug,savio_normal|savio_normal|| brc|fc_lee|lee|savio|...|savio_debug,savio_normal|savio_normal||

The Account, Partition, and QOS fields indicate which partitions and QoSes you have access to under each of your account(s). The Def QoS field identifies the default QoS that will be used if you do not explicitly identify a QoS when submitting a job. Thus as per the example above, if the user lee submitted a batch job using their fc_lee account, they could submit their job to either the savio2_gpu, savio2_htc, savio2_bigmem, savio2, savio, or savio_bigmem partitions. (And when doing so, they could also choose either the savio_debug or savio_normal QoS, with a default of savio_normal if no QoS was specified.)

If you are running your job in a condo, be sure to note which of the line(s) of output associated with the condo account (those beginning with “co_” ) have their Def QoS being a lowprio QoS and which have a normal QoS. Those with a normal QoS (such as the line highlighted in boldface text in the above example) are the QoS to which you have priority access, while those with a lowprio QoS are those to which you have only low priority access. Thus, in the above example, the user lee should select the co_physics account and the savio2 partition when they want to run jobs with normal priority, using the resources available via their condo membership.

You can find more details on the hardware specifications for the machines in the various partitions here for the Savio and CGRL (Vector/Rosalind) clusters.

You can find more details on each partition and the QoS available in those partitions here for the Savio and CGRL (Vector/Rosalind) clusters.

A standard fair-share policy with a decay half-life of 14 days (2 weeks) is enforced.

Low priority jobs via a condo account

As a condo contributor, are entitled to use the extra resource that is available on the SAVIO cluster (across all partitions). This is done through a low priority QoS "savio_lowprio" and your account is automatically subscribed to this QoS during the account creation stage. You do not need to request it explicitly. By using this QoS you are no longer limited by your condo size. What this means to users is that you will now have access to the broader compute resource which is limited by the size of partitions. The per-job limit is 24 nodes and 3 days runtime. Additionally, these jobs can be run on all types and generations of hardware resources in the Savio cluster, not just those matching the condo contribution. At this time, there is no limit or allowance applied to this QoS, so condo contributors are free to use as much of the resource as they need.

However this QoS does not get a priority as high as the general QoSs, such as "savio_normal" and "savio_debug", or all the condo QoSs, and it is subject to preemption when all the other QoSs become busy. Thus it has two implications:

  1. When system is busy, any job that is submitted with this QoS will be pending and yield to other jobs with higher priorities.
  2. When system is busy and there are higher priority jobs pending, the scheduler will preempt jobs that are running with this lower priority QoS. At submission time, the user can choose whether a preempted jobs should simply be killed or should be automatically requeued after it's killed. Please note that, since preemption could happen at any time, it is very beneficial if your job is capable of checkpointing/restarting by itself, when you choose to requeue the job. Otherwise, you may need to verify data integrity manually before you want to run the job again.

We provide an example job script for such jobs

Long-running jobs via an FCA

Most jobs running under an FCA have a maximum time limit of 72 hours (three days). However users can run jobs using a small number of cores in the long queue, using the savio2_htc partition and the savio_long QoS.

A given job in the long queue can use no more than 4 cores and a maximum of 10 days. Collectively across the entire Savio cluster, at most 24 cores are available for long-running jobs, so you may find that your job may sit in the queue for a while before it starts.

We provide an example job script for such jobs

Submitting your job

This section provides an overview of how to run your jobs in batch (i.e., non-interactive or background) mode and in interactive mode.

In addition to the key options of account, partition, and QoS, your job script files can also contain options to request various numbers of cores, nodes, and/or computational tasks. And there are a variety of additional options you can specify in your batch files, if desired, such as email notification options when a job has completed. These are all described further below.

Batch jobs

When you want to run one of your jobs in batch (i.e. non-interactive or background) mode, you’ll enter an sbatch command. As part of that command, you will also specify the name of, or filesystem path to, a SLURM job script file; e.g., sbatch myjob.sh

A job script specifies where and how you want to run your job on the cluster, and ends with the actual command(s) needed to run your job. The job script file looks much like a standard shell script (.sh) file, but also includes one or more lines that specify options for the SLURM scheduler; e.g.

#SBATCH --some-option-here

Although these lines start with hash signs (#), and thus are regarded as comments by the shell, they are nonetheless read and interpreted by the SLURM scheduler.

Here is a minimal example of a job script file that includes the required account, partition, and time options, as well as a qos specification. It will run unattended for up to 30 seconds on one of the compute nodes in the partition_name partition, and will simply print out the words, “hello world”:

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Quality of Service:
#SBATCH --qos=qos_name
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run:
echo "hello world"

In this and other examples, account_name, partition_name, and qos_name are placeholders for actual values you will need to provide. See Key options to set, above, for information on what values to use in your own job script files.

See Job submission with specific resource requirements, below, for a set of example job script files, each illustrating how to run a specific type of job.

See Finding output to learn where output from running your batch jobs can be found. If errors occur when running your batch job, this is the first place to look for these.

Interactive jobs

In some instances, you may need to use software that requires user interaction, rather than running programs or scripts in batch mode. To do so, you must first start an instance of an interactive shell on a Savio compute node, within which you can then run your software on that node. To run such an interactive job on a compute node, you’ll use srun. Here is a basic example that launches an interactive ‘bash’ shell on that node, and includes the required account and partition options:

[user@ln001 ~]$ srun --pty -A account_name -p partition_name -t 00:00:30 bash -i

Once your job starts, the Linux prompt will change and indicate you are on a compute node rather than a login node:

srun: job 669120 queued and waiting for resources
srun: job 669120 has been allocated resources
[user@n0047 ~]

CGRL jobs

  • The settings for a job in Vector (Note: you don't need to set the "account"): --partition=vector --qos=vector_batch
  • The settings for a job in Rosalind (Savio1): --partition=savio --account=co_rosalind --qos=rosalind_savio_normal
  • The settings for a job in Rosalind (Savio2 HTC): --partition=savio2_htc --account=co_rosalind --qos=rosalind_htc2_normal

Job submission with specific resource requirements

This section details options for specifying the resource requirements for you jobs. We also provide a variety of example job scripts for setting up parallelization, low-priority jobs, jobs using fewer cores than available on a node, and long-running FCA jobs.

Remember that nodes are assigned for exclusive access by your job, except in the “savio2_htc” and “savio2_gpu” partitions. So, if possible, you generally want to set SLURM options and write your code to use all the available resources on the nodes assigned to your job (e.g., 20 cores and 64 GB memory per node in the “savio” partition).

Memory available

Also note that in all partitions except for GPU and HTC partitions, by default the full memory on the node(s) will be available to your job. On the GPU and HTC partitions you get an amount of memory proportional to the number of cores your job requests relative to the number of cores on the node. For example, if the node has 64 GB and 8 cores, and you request 2 cores, you’ll have access to 16 GB memory. If you need more memory than that, you should request additional cores. Please do not request memory using the memory-related flags available for sbatch and srun.

Parallelization

When submitting parallel code, you usually need to specify the number of tasks, nodes, and CPUs to be used by your job in various ways. For any given use case, there are generally multiple ways to set the options to achieve the same effect; these examples try to illustrate what we consider to be best practices.

The key options for parallelization are:

  • --nodes (or -N): indicates the number of nodes to use
  • --ntasks-per-node: indicates the number of tasks (i.e., processes one wants to run on each node)
  • --cpus-per-task (or -c): indicates the number of cpus to be used for each task

In addition, in some cases it can make sense to use the --ntasks (or -n) option to indicate the total number of tasks and let the scheduler determine how many nodes and tasks per node are needed. In general –cpus-per-task will be 1 except when running threaded code.  

Note that if the various options are not set, SLURM will in some cases infer what the value of the option needs to be given other options that are set and in other cases will treat the value as being 1. So some of the options set in the example below are not strictly necessary, but we give them explicitly to be clear.

Here’s an example script that requests an entire Savio node and specifies 20 cores per task.

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=savio
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --time=00:00:30
## Command(s) to run:
echo "hello world"

Only the partition, time, and account flags are required. And, strictly speaking, the account will default to a default account if you don’t specify it, so in some cases that can be omitted.

Additional job submission options

Here are some additional options that you can incorporate as needed into your own scripts. For the full set of available options, please see the SLURM documentation on the sbatch command.

Output options

Output from running a SLURM batch job is, by default, placed in a log file named slurm-%j.out, where the job’s ID is substituted for %j; e.g. slurm-478012.out This file will be created in your current directory; i.e. the directory from within which you entered the sbatch command. Also by default, both command output and error output (to stdout and stderr, respectively) are combined in this file.

To specify alternate files for command and error output use:

  • --output: destination file for stdout
  • --error: destination file for stderr
Email Notifications

By specifying your email address, the SLURM scheduler can email you when the status of your job changes. Valid options are BEGIN, END, FAIL, REQUEUE, and ALL, and multiple options can be separated by commas.

The required options for email notifications are:

  • --mail-type: when you want to be notified
  • --mail-user: your email address
Submission with output and email options
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=savio
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --time=00:00:30
#SBATCH --output=test_job_%j.out
#SBATCH --error=test_job_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=jessie.doe@berkeley.edu
## Command(s) to run:
echo "hello world"
QoS options

While your job will use a default QoS, generally savio_normal, you can specify a different QoS, such as the debug QoS for short jobs, using the --qos flag, e.g.,

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=savio
#SBATCH --qos=savio_debug
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
##SBATCH --time=00:00:30
## Command(s) to run:
echo "hello world"

Bundling tasks into a single job to use all the cores on a node

Many users have multiple jobs that each use only a single core or a small number of cores and therefore cannot take advantage of all the cores on a Savio node. There are two tools that allow one to automate the parallelization of such jobs, in particular allowing one to group jobs into a single SLURM submission to take advantage of the multiple cores on a given Savio node.

For this purpose, we recommend the use of the community-supported GNU parallel tool. One can instead use Savio’s HT Helper tool, but for users not already familiar with either tool, we recommend GNU parallel.

Job arrays

Job arrays allow many jobs to be submitted simultaneously with the same requirements. Within each job the environment variable $SLURM_ARRAY_TASK_ID is set and can be used to alter the execution of each job in the array. Note that as is true for regular jobs, each job in the array will be allocated one or more entire nodes (except for the HTC or GPU partitions), so job arrays are not a way to bundle multiple tasks to run on a single node.

By default, output from job arrays is placed in a series of log files named slurm-%A_%a.out, where %A is the overall job ID and %a is the task ID.

For example, the following script would write “I am task 0” to array_job_XXXXXX_task_0.out, “I am task 1” to array_job_XXXXXX_task_1.out, etc, where XXXXXX is the job ID.

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=savio
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --time=00:00:30
#SBATCH --array=0-31
#SBATCH --output=array_job_%A_task_%a.out
#SBATCH --error=array_job_%A_task_%a.err
## Command(s) to run:
echo "I am task $SLURM_ARRAY_TASK_ID"

Checkpointing/Restarting your jobs

Checkpointing/restarting is the technique where the applications’ state is stored in the filesystem, allowing the user to restart computation from this saved state in order to minimize loss of computation time, for example, when the job reaches its allowed wall clock limit, the job is preempted, or when software/hardware faults occur, etc. By checkpointing or saving intermediate results frequently the user won’t lose as much work if their jobs are preempted or otherwise terminated prematurely for some reason.

If you wish to do checkpointing, your first step should always be to check if your application already has such capabilities built-in, as that is the most stable and safe way of doing it. Applications that are known to have some sort of native checkpointing include: Abaqus, Amber, Gaussian, GROMACS, LAMMPS, NAMD, NWChem, Quantum Espresso, STAR-CCM+, and VASP.

In case your program does not natively support checkpointing, it may also be worthwhile to consider utilizing generic checkpoint/restart solutions if and as needed that should work application-agnostic. One example of such a solution is DMTCP: Distributed MultiThreaded CheckPointing. DMTCP is a checkpoint-restart solution that works outside the flow of user applications, enabling their state to be saved without application alterations. You can find its reference quick-start documentation here.

Monitoring the status of running batch jobs

To monitor a running job, you need to know the SLURM job ID of that job, which can be obtained by running

squeue -u $USER

In the commands below, substitute the job ID for “$your_job_id”.

If you suspect your job is not running properly, or you simply want to understand how much memory or how much CPU the job is actually using on the compute nodes, Savio provides a script “wwall” to check that.

The following provides a snapshot of node status that the job is running on:

wwall -j $your_job_id

while

wwall -j $your_job_id -t

provides a text-based user interface (TUI) to monitor the node status when the job progresses. To exit the TUI, enter “q” to quit out of the interface and be returned to the command line.

Alternatively, you can login to the node your job is running on as follows:

srun --jobid=$your_job_id --pty /bin/bash

This runs a shell in the context of your existing job. Once on the node, you can run top, htop, ps, or other tools.

You can also see a “top”-like summary for all nodes by running wwtop from a login node. You can use the page up and down keys to scroll through the nodes to find the node(s) your job is using. All CPU percentages are relative to the total number of cores on the node, so 100% usage would mean that all of the cores are being fully used.

Checking finished jobs

There are several commands you can use to better understand how a finished job ran.

First of all, you should look for the SLURM output and error files that may be created in the directory from which you submitted the job. Unless you have specified your own names for these files they will be names slurm-<jobid>.out and slurm-<jobid>.err.

After a job has completed (or been terminated/cancelled), you can review the maximum memory used via the sacct command.

sacct -j <JOB_ID> --format=JobID,JobName,MaxRSS,Elapsed

MaxRSS will show the maximum amount of memory that the job used in kilobytes.

You can check all the jobs that you ran within a time window as follows

sacct -u <your_user_name> --starttime=2019-09-27 --endtime=2019-10-04 \
   --format JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Start,End,NodeList

Please see man sacct for a list of the output columns you can request, as well as the SLURM documentation for the sacct command here.

Migrating from other schedulers to SLURM

We provide some tips on migrating to SLURM for users familiar with Torque/PBS.

For users coming from other schedulers, such as Platform LSF, SGE/OGE, Load Leveler, please use this link to find a quick reference.

Tags: hpc All Tags