Monitoring Jobs

Monitoring the status of running batch jobs

To monitor a running job, you need to know the SLURM job ID of that job, which can be obtained by running

squeue -u $USER

In the commands below, substitute the job ID for "$your_job_id".

If you suspect your job is not running properly, or you simply want to understand how much memory or how much CPU the job is actually using on the compute nodes, Savio provides a script "wwall" to check that.

The following provides a snapshot of node status that the job is running on:

wwall -j $your_job_id

while

wwall -j $your_job_id -t

provides a text-based user interface (TUI) to monitor the node status when the job progresses. To exit the TUI, enter "q" to quit out of the interface and be returned to the command line.

Alternatively, you can login to the node your job is running on as follows:

srun --jobid=$your_job_id --pty /bin/bash

This runs a shell in the context of your existing job. Once on the node, you can run top, htop, ps, or other tools.

You can also see a "top"-like summary for all nodes by running wwtop from a login node. You can use the page up and down keys to scroll through the nodes to find the node(s) your job is using. All CPU percentages are relative to the total number of cores on the node, so 100% usage would mean that all of the cores are being fully used.

Checking finished jobs

There are several commands you can use to better understand how a finished job ran.

First of all, you should look for the SLURM output and error files that may be created in the directory from which you submitted the job. Unless you have specified your own names for these files they will be names slurm-<jobid>.out and slurm-<jobid>.err.

After a job has completed (or been terminated/cancelled), you can review the maximum memory used via the sacct command.

sacct -j <JOB_ID> --format=JobID,JobName,MaxRSS,Elapsed

MaxRSS will show the maximum amount of memory that the job used in kilobytes.

You can check all the jobs that you ran within a time window as follows

sacct -u <your_user_name> --starttime=2019-09-27 --endtime=2019-10-04 \
   --format JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Start,End,NodeList

Please see man sacct for a list of the output columns you can request, as well as the SLURM documentation for the sacct command here.