Skip to content

When Will My Job Start?

Overview

Do you want to know why your job is not running, when it might start, or what you might do to get it to start more quickly? You'll find some tips and tricks here. If you're looking for a more general information on how Slurm works, or about how job priortiziation on Savio works, you can find that in our Slurm Scheduler Overview.

Why is my job not running?

There are a variety of reasons your job may not be running. Some of them may prevent your job from ever running. Others are simply reasons why your job is queued and waiting for resources on the system to become available.

We have developed a helper tool called sq that will try to provide user-friendly information on why a job is not (yet) running:

module load sq
sq

By default, sq will show your pending jobs (or recent jobs if there are no pending jobs) with warning/error messages about potentially problematic situations. If there is a potential problem with a job, it will also suggest a solution.

If you want to see both current and past jobs at the same time, you can use the -a flag. The -q flag silences any error messages so you only see the list of jobs. Other command-line options for controlling which jobs to display are described by sq --help.

If you have encountered a case where sq does not adequately explain the problem, please consider opening an issue so that we can improve it!

asciicast

Some of the reasons that sq may note for why your job has not started include the following.

Reasons your job might not ever start:

  • You submitted to a partition or QoS or for a number of nodes that you don't have access to.
  • QOSMaxWallDurationPerJobLimit: You submitted with a time limit that is longer than the maximum time possible for a job in a given QoS.
  • QOSMaxNodePerJobLimit: You requested more nodes than allowed for jobs in a given QoS.
  • QOSMinCpuNotSatisfied: You did not request a sufficient number of CPUs for the GPU(s) requested.
  • QOSMinGRES: You requested an invalid GPU type or did not request a specific GPU type, as required in savio3_gpu and savio4_gpu, or you didn't specify the QoS, as required for FCA usage of A40 and V100 GPUs in savio3_gpu.
  • AssocGrpCPUMinutesLimit: Your FCA does not have enough service units, possibly because the FCA was not renewed at the start of the yearly cycle in June.

Reasons your job might not have started yet:

  • ReqNodeNotAvail, Reserved for Maintenance: There may be an upcoming downtime on Savio that overlaps with the time it would take to complete your job.
  • QOSGrpCpuLimit or QOSGrpNodeLimit: For condo users, other users in the group may be using the entire allotment of nodes in the condo. For FCA users, the total number of cores or nodes used by FCA jobs in small partitions on which FCA usage is capped at less than the full partition may be at its limit.
  • Resources or Priority: There may not be free nodes available (or free cores for partitions allocated per core) at the moment.
  • QOSMaxCpuPerUserLimit: A partition might have a limit on the resources any user can use at a single time (across all the user's jobs). Specifically, on savio4_gpu, FCA users are limited to at most 16 CPUs at any given time. Given the 4:1 CPU:GPU ratio for the A5000 GPUs in this partition, that corresponds also to a limit of 4 GPUs.
  • Your job might have started and exited quickly (perhaps because of an error).

You can also use Slurm's commands, such as squeue to try to understand why your job hasn't started (focusing on the "NODELIST(REASON)" column, but in many cases it can be difficult to interpret the output. In addition to the information above, the REASON codes are explained in man squeue.

When will my job start?

You can look at the Slurm queue to get a sense for how many other jobs are pending in the relevant partition and where your job is in the queue.

squeue -p <partition_name> --state=PD -l

If you are using a condo, you can check how many other jobs are pending under the condo QoS:

squeue -q <condo_qos> --state=PD

You can see the jobs run by other users in your group by specifying the account name:

squeue -A <account>

You can also ask Slurm to estimate the start time:

squeue -j <jobid> --start

This is only an estimate. Slurm bases this on the time limits provided for all jobs, but in most cases these will not be the actual run times of the jobs.

What can I do to get my job to start more quickly?

There are a few things you might be able to do to get your job to start faster.

  • Shorten the time limit on your job, if possible. This may allow the scheduler to fit your job into a time window while it is trying to make room for a larger job (using Slurm's backfill functionality).
  • Request fewer nodes (or fewer cores on partitions scheduled by core), if possible. This may allow the scheduler to fit your job into a time window while it is trying to make room for a larger job (using Slurm's backfill functionality).
  • If you are using an FCA, but you have access to a condo, you might submit to the condo, as condos get higher priority access to a pool of nodes equivalent to those nodes purchased by the condo.
  • If you are using a condo and your fellow group members are using the entire pool of condo nodes, you might submit to an FCA instead. The following command may be useful to assess usage within the condo:
    squeue -q <condo_qos>
    
  • Submit to a partition that is less used. You can use the following command to see how many (if any) idle nodes there are. NOTE: idle nodes may not be available for your job if Slurm is reserving them to accommodate another (multi-node) job.
    sinfo -p <partition_name>
    sinfo -p <partition_name> state=idle
    
  • Wait to submit if you or fellow FCA users in your group have submitted many jobs recently under an FCA. Because usage is downweighted over time, as days go by, your priority for any new jobs you submit will increase. However, your priority is affected by other usage in the FCA group, so if other users in your group continue to heavily use the FCA, then your priority may not increase.