Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

Slurm job states track the lifecycle of a specific workload initiated by a user in the cluster. Slurm job states are distinct from Slurm node states, which describe the availability and health of the hardware used to support the Slurm cluster. CoreWeave Grafana includes a variety of dashboards for tracking metrics related to job performance and resource usage within your SUNK cluster. Slurm also provides commands for tracking job states and steps, including:
CommandData typePurpose
squeueReal-timeOverview of jobs in queue or currently running
scontrolReal-timeDetailed, live metadata about a specific job
sacctHistoricAccounting and history; resource usage of completed jobs
Run all Slurm commands, including squeue, scontrol, and sacct, from within the Slurm login pod shell.

View a summary of current jobs with squeue

Use the squeue command to view real-time status information about Slurm jobs and job steps. squeue provides an overview of jobs currently running or in queue. For more detailed real-time information about a particular job, use scontrol. Run the squeue command by itself to display the status of every active job in the system:
squeue
To filter for specific information, add flags to the squeue command. To display details about a specific job, use the -j flag:
squeue -j [JOB-ID]
To view information about jobs in a specific state, use squeue with the -t flag. Enter Slurm job state codes as a comma-separated list.
squeue -t [STATE]
To view jobs started by a specific user, use the -u flag:
squeue -u [USERNAME]
By default, squeue may omit job steps to provide a cleaner output. To display job steps with squeue, use the -s flag:
squeue -s

Determine the reason a job is PENDING

Running a Slurm job is a request for resource allocation, not an instantaneous execution. Until the requested resources are available, the job will remain in a PENDING state. This is expected behavior and is part of the Slurm job lifecycle. The squeue command includes a REASON column that provides insight into why a job is in a given state. In a healthy Slurm queue, these reasons are common:
Listed reasonMeaning
ResourcesThe cluster does not currently have the resources available to execute the job.
PriorityOther jobs with a higher priority are preempting this job. When the higher-priority jobs are complete, this job will run.
DependencyOccurs when the --dependency flag is in use, and the job being waited on has not yet reached the required state.
JobArraySizeLimitYou have reached the maximum number of simultaneously running tasks allowed for a single Job Array.
Running squeue sends a remote procedure call to slurmctld. To avoid overloading the Slurm Controller and impacting performance, limit squeue calls to the minimum necessary.

View detailed information about a specific Slurm job with scontrol

The scontrol command provides a more detailed view about a particular job, compared to the summary offered by squeue. Like squeue, scontrol displays live, real-time data, and can be configured to display job steps. To view detailed information about a specific job, use scontrol show job with the relevant [JOB-ID]:
scontrol show job [JOB-ID]
To view granular details about a specific job step, use scontrol show step with the relevant [JOB-ID] and [STEP-ID]:
scontrol show step [JOB-ID].[STEP-ID]

View information about completed Slurm jobs with sacct

To view information about completed jobs, including job steps and resource usage, use the sacct command. Unlike squeue and scontrol, sacct displays data about jobs that are not currently running or in queue. The output of sacct displays job steps by default. For a comprehensive view of job steps for a given job, use the sacct command with the -j flag:
sacct -j [JOB-ID]
Add the --showsteps flag to the above command to explicitly list individual steps in the output:
sacct -j [JOB-ID] --showsteps
Use the --format flag with sacct to format the output in a more readable manner:
sacct --format=JobID,JobName,State,ExitCode

Slurm job state codes

Slurm job states are designated by codes, viewable in the STATE or ST columns of the squeue or sacct command output.
Job stateShorthandMeaning
PENDINGPDThe job is waiting for resource allocation. Use squeue to view the reason the job is in this state.
RUNNINGRThe job has an allocation of nodes and is currently executing job steps.
CONFIGURINGCFResources have been allocated, but the nodes are still being configured.
COMPLETINGCGThe job is finishing. Cleanup scripts are executing and nodes are being released back to the nodepool.
COMPLETEDCDThe job finished successfully with an exit code of 0.
FAILEDFThe job terminated with a failure condition, or non-zero exit code.
TIMEOUTTOThe job terminated after reaching its requested time limit.
NODE_FAILNFThe job terminated after one or more of the nodes running it crashed or became unresponsive.
PREEMPTEDPRThe job was removed from its allocated nodes to make room for a higher-priority job.
Last modified on April 17, 2026