Slurm job states track the lifecycle of a specific workload initiated by a user in the cluster. Slurm job states are distinct from Slurm node states, which describe the availability and health of the hardware used to support the Slurm cluster. CoreWeave Grafana includes a variety of dashboards for tracking metrics related to job performance and resource usage within your SUNK cluster. Slurm also provides commands for tracking job states and steps, including:Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
| Command | Data type | Purpose |
|---|---|---|
squeue | Real-time | Overview of jobs in queue or currently running |
scontrol | Real-time | Detailed, live metadata about a specific job |
sacct | Historic | Accounting and history; resource usage of completed jobs |
View a summary of current jobs with squeue
Use the squeue command to view real-time status information about Slurm jobs and job steps. squeue provides an overview of jobs currently running or in queue. For more detailed real-time information about a particular job, use scontrol.
Run the squeue command by itself to display the status of every active job in the system:
squeue command.
To display details about a specific job, use the -j flag:
squeue with the -t flag. Enter Slurm job state codes as a comma-separated list.
-u flag:
squeue may omit job steps to provide a cleaner output. To display job steps with squeue, use the -s flag:
Determine the reason a job is PENDING
Running a Slurm job is a request for resource allocation, not an instantaneous execution. Until the requested resources are available, the job will remain in a PENDING state. This is expected behavior and is part of the Slurm job lifecycle.
The squeue command includes a REASON column that provides insight into why a job is in a given state.
In a healthy Slurm queue, these reasons are common:
| Listed reason | Meaning |
|---|---|
Resources | The cluster does not currently have the resources available to execute the job. |
Priority | Other jobs with a higher priority are preempting this job. When the higher-priority jobs are complete, this job will run. |
Dependency | Occurs when the --dependency flag is in use, and the job being waited on has not yet reached the required state. |
JobArraySizeLimit | You have reached the maximum number of simultaneously running tasks allowed for a single Job Array. |
Running
squeue sends a remote procedure call to slurmctld. To avoid overloading the Slurm Controller and impacting performance, limit squeue calls to the minimum necessary.View detailed information about a specific Slurm job with scontrol
The scontrol command provides a more detailed view about a particular job, compared to the summary offered by squeue. Like squeue, scontrol displays live, real-time data, and can be configured to display job steps.
To view detailed information about a specific job, use scontrol show job with the relevant [JOB-ID]:
scontrol show step with the relevant [JOB-ID] and [STEP-ID]:
View information about completed Slurm jobs with sacct
To view information about completed jobs, including job steps and resource usage, use the sacct command. Unlike squeue and scontrol, sacct displays data about jobs that are not currently running or in queue. The output of sacct displays job steps by default.
For a comprehensive view of job steps for a given job, use the sacct command with the -j flag:
--showsteps flag to the above command to explicitly list individual steps in the output:
--format flag with sacct to format the output in a more readable manner:
Slurm job state codes
Slurm job states are designated by codes, viewable in theSTATE or ST columns of the squeue or sacct command output.
| Job state | Shorthand | Meaning |
|---|---|---|
PENDING | PD | The job is waiting for resource allocation. Use squeue to view the reason the job is in this state. |
RUNNING | R | The job has an allocation of nodes and is currently executing job steps. |
CONFIGURING | CF | Resources have been allocated, but the nodes are still being configured. |
COMPLETING | CG | The job is finishing. Cleanup scripts are executing and nodes are being released back to the nodepool. |
COMPLETED | CD | The job finished successfully with an exit code of 0. |
FAILED | F | The job terminated with a failure condition, or non-zero exit code. |
TIMEOUT | TO | The job terminated after reaching its requested time limit. |
NODE_FAIL | NF | The job terminated after one or more of the nodes running it crashed or became unresponsive. |
PREEMPTED | PR | The job was removed from its allocated nodes to make room for a higher-priority job. |