Monitoring the status of running batch jobs¶
To monitor a running job, you need to know the SLURM job ID of that job, which can be obtained by running
squeue -u $USER
In the commands below, substitute the job ID for "$your_job_id".
Monitoring the job from a login node¶
If you suspect your job is not running properly, or you simply want to understand how much memory or how much CPU the job is actually using on the compute nodes, Savio provides a script "wwall" to check that.
The following provides a snapshot of node status that the job is running on:
wwall -j $your_job_id
wwall -j $your_job_id -t
provides a text-based user interface (TUI) to monitor the node status when the job progresses. To exit the TUI, enter "q" to quit out of the interface and be returned to the command line.
You can also see a "top"-like summary for all nodes by running
wwtop from a login node. You can use the page up and down keys to scroll through the nodes to find the node(s) your job is using. All CPU percentages are relative to the total number of cores on the node, so 100% usage would mean that all of the cores are being fully used.
Monitoring the job by logging into the compute node¶
Alternatively, you can login to the node your job is running on as follows:
srun --jobid=$your_job_id --pty /bin/bash
This runs a shell in the context of your existing job. Once on the node, you can run
ps, or other tools.
If you're running a multi-node job, the commands above will get you onto the first node, from which you can ssh to the other nodes if desired. You can determine the other nodes based on the
SLURM_NODELIST environment variable.
Checking finished jobs¶
There are several commands you can use to better understand how a finished job ran.
First of all, you should look for the SLURM output and error files that may be created in the directory from which you submitted the job. Unless you have specified your own names for these files they will be names
After a job has completed (or been terminated/cancelled), you can review the maximum memory used via the sacct command.
sacct -j <JOB_ID> --format=JobID,JobName,MaxRSS,Elapsed
MaxRSS will show the maximum amount of memory that the job used in kilobytes. Unfortunately that only counts the memory used by the processes started from your job script, not by subprocesses. So, for example, if you are doing parallelization in Python or R that starts worker processes, the memory use of those workers will not be included.
You can check all the jobs that you ran within a time window as follows
sacct -u <your_user_name> --starttime=2019-09-27 --endtime=2019-10-04 \ --format JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Start,End,NodeList
man sacct for a list of the output columns you can request, as well as the SLURM documentation for the
sacct command here.
You can also check finished jobs and monitor the status of running jobs by visiting the MyBRC User Portal in your web browser. On the Home ("Welcome") page, click on "Jobs" at the top of the page. This will take you to the "Job List" page where you can fill out a search form to view jobs belonging to you and belonging to projects in which you are a PI or manager. To view all jobs select "Show All Jobs" in the search form. Note that the information that is visibile to you may depend on whether you are a regular user or a PI/project manager, as different users will have different visibility permissions. Users can view SLURM job info. such as the SLURM ID of the job, the status of a job, the partition the job is running on, the submission date of the job, and the number of Service Units consumed by the job. To view additional details (such as the start and end date of the job, the nodes the job has run on, Quality of Service, number of CPUs, CPU time, etc.), click on the SLURM ID of a given job in the list.