> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Monitor Slurm job states

> View real-time and historic information about Slurm jobs

Slurm job states track the lifecycle of a specific workload initiated by a user in the cluster. This page explains how to monitor those job states so that you can confirm a workload is progressing as expected, diagnose why a job is waiting, or review the resource usage of a job that's already finished.

Slurm job states are distinct from [Slurm node states](/products/sunk/manage_sunk/slurm-node-states), which describe the availability and health of the hardware used to support the Slurm cluster.

[CoreWeave Grafana](/observability/managed-grafana#sunk) includes dashboards for tracking metrics related to job performance and resource usage within your SUNK cluster.

Slurm also provides commands for tracking job states and steps. The following sections describe when to use each one:

| Command    | Data type | Purpose                                                  |
| ---------- | --------- | -------------------------------------------------------- |
| `squeue`   | Real-time | Overview of jobs in queue or currently running           |
| `scontrol` | Real-time | Detailed, live metadata about a specific job             |
| `sacct`    | Historic  | Accounting and history; resource usage of completed jobs |

<Tip>
  Run all Slurm commands, including `squeue`, `scontrol`, and `sacct`, from within the Slurm login pod shell.
</Tip>

## View a summary of current jobs with `squeue`

Use the `squeue` command to view real-time status information about Slurm jobs and job steps. `squeue` provides an overview of jobs currently running or in queue. For more detailed real-time information about a particular job, use `scontrol`.

Run the `squeue` command by itself to display the status of every active job in the system:

```bash theme={"system"}
squeue
```

To filter for specific information, add flags to the `squeue` command. The following examples show common filters.

To display details about a specific job, use the `-j` flag:

```bash theme={"system"}
squeue -j [JOB-ID]
```

To view information about jobs in a specific state, use `squeue` with the `-t` flag. Enter [Slurm job state codes](#slurm-job-state-codes) as a comma-separated list.

```bash theme={"system"}
squeue -t [STATE]
```

To view jobs started by a specific user, use the `-u` flag:

```bash theme={"system"}
squeue -u [USERNAME]
```

By default, `squeue` may omit job steps to keep the output concise. To display job steps with `squeue`, use the `-s` flag:

```bash theme={"system"}
squeue -s
```

### Determine the reason a job is `PENDING`

Running a Slurm job is a request for resource allocation, not an instantaneous execution. Until the requested resources are available, the job remains in a `PENDING` state. This is expected behavior and is part of the Slurm job lifecycle.

The `squeue` command includes a `REASON` column that provides insight into why a job is in a given state.

In a healthy Slurm queue, these reasons are common:

| Listed reason       | Meaning                                                                                                               |
| ------------------- | --------------------------------------------------------------------------------------------------------------------- |
| `Resources`         | The cluster doesn't currently have the resources available to execute the job.                                        |
| `Priority`          | Other jobs with a higher priority are preempting this job. When the higher-priority jobs are complete, this job runs. |
| `Dependency`        | Occurs when the `--dependency` flag is in use, and the job being waited on hasn't yet reached the required state.     |
| `JobArraySizeLimit` | You have reached the maximum number of simultaneously running tasks allowed for a single job array.                   |

<Note>
  Running `squeue` sends a remote procedure call to `slurmctld`. To avoid overloading the Slurm Controller and impacting performance, limit `squeue` calls to the minimum necessary.
</Note>

## View detailed information about a specific Slurm job with `scontrol`

The `scontrol` command provides a more detailed view about a particular job, compared to the summary offered by `squeue`. Like `squeue`, `scontrol` displays live, real-time data, and you can configure it to display job steps.

To view detailed information about a specific job, use `scontrol show job` with the relevant `[JOB-ID]`:

```bash theme={"system"}
scontrol show job [JOB-ID]
```

To view granular details about a specific job step, use `scontrol show step` with the relevant `[JOB-ID]` and `[STEP-ID]`:

```bash theme={"system"}
scontrol show step [JOB-ID].[STEP-ID]
```

## View information about completed Slurm jobs with `sacct`

To view information about completed jobs, including job steps and resource usage, use the `sacct` command. Unlike `squeue` and `scontrol`, `sacct` displays data about jobs that aren't currently running or in queue. The output of `sacct` displays job steps by default.

For a comprehensive view of job steps for a given job, use the `sacct` command with the `-j` flag:

```bash theme={"system"}
sacct -j [JOB-ID]
```

Add the `--showsteps` flag to the preceding command to explicitly list individual steps in the output:

```bash theme={"system"}
sacct -j [JOB-ID] --showsteps
```

Use the `--format` flag with `sacct` to format the output in a more readable manner:

```bash theme={"system"}
sacct --format=JobID,JobName,State,ExitCode
```

## Slurm job state codes

Use this reference to interpret the state codes that appear in the output of the preceding commands. Slurm job states are designated by codes, which appear in the `STATE` or `ST` columns of the `squeue` or `sacct` command output.

| Job state     | Shorthand | Meaning                                                                                                                                         |
| ------------- | --------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `PENDING`     | `PD`      | The job is waiting for resource allocation. Use `squeue` to [view the reason the job is in this state](#determine-the-reason-a-job-is-pending). |
| `RUNNING`     | `R`       | The job has an allocation of nodes and is currently executing job steps.                                                                        |
| `CONFIGURING` | `CF`      | Resources have been allocated, but the nodes are still being configured.                                                                        |
| `COMPLETING`  | `CG`      | The job is finishing. Cleanup scripts are executing and nodes are returning to the nodepool.                                                    |
| `COMPLETED`   | `CD`      | The job finished successfully with an exit code of `0`.                                                                                         |
| `FAILED`      | `F`       | The job terminated with a failure condition, or non-zero exit code.                                                                             |
| `TIMEOUT`     | `TO`      | The job terminated after reaching its requested time limit.                                                                                     |
| `NODE_FAIL`   | `NF`      | The job terminated after one or more of the nodes running it crashed or became unresponsive.                                                    |
| `PREEMPTED`   | `PR`      | The job was removed from its allocated nodes to make room for a higher-priority job.                                                            |
