> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Slurm Job Metrics

> View detailed metrics about a Slurm job with Grafana

<Info>
  For instructions about accessing CoreWeave Grafana dashboards, see [Access and use CoreWeave Grafana dashboards](/observability/managed-grafana/access).
</Info>

The **Slurm Job Metrics** dashboard displays detailed information about the performance and hardware utilization of a selected Slurm job. Use this page to understand which panels the dashboard provides, how to filter the data it shows, and how to interpret the metrics when investigating job performance.

To open the dashboard, go to the [Slurm Job Metrics dashboard](https://cks-grafana.coreweave.com/d/slurm-job-metrics/slurm-job-metrics).

You can use this dashboard to:

* Monitor the GPU and CPU utilization of a given Slurm job.
* Track the rate of filesystem operations related to the Slurm job.
* View Node conditions and alerts that may impact the performance of the Slurm job.

The following sections describe the filters available at the top of the dashboard and the panels grouped under each section.

## Filters and parameters

Use these filters at the top of the page to choose the data you want to view:

| Field                       | Value                                                                                            |
| --------------------------- | ------------------------------------------------------------------------------------------------ |
| **Data Source**             | The Prometheus data source selector.                                                             |
| **Logs Source**             | The Loki logs source selector.                                                                   |
| **Org**                     | The organization that owns the cluster.                                                          |
| **Cluster**                 | The specific Kubernetes cluster in the organization to view.                                     |
| **Namespace**               | The Kubernetes namespace where the Slurm cluster is located.                                     |
| **Slurm Cluster**           | The Slurm cluster containing the job to view.                                                    |
| **Job Name**                | The name of the Slurm job to view. Select "All" to view all the Slurm jobs in the Slurm Cluster. |
| **Job ID**                  | The Slurm Job ID to view. Select "All" to view all the Slurm jobs in the Slurm Cluster.          |
| **Node Conditions**         | Toggled on by default. Click to disable.                                                         |
| **Pending Node Alerts**     | Toggled on by default. Click to disable.                                                         |
| **Firing Node Alerts**      | Toggled off by default. Click to enable.                                                         |
| **InfiniBand Fabric Flaps** | Toggled off by default. Click to enable.                                                         |

The dashboard also includes buttons with links to the **SLURM / Namespace** dashboard and `slurmctld` logs.

Set the time range and refresh rate parameters at the top-right of the page. The default time range is 5 minutes, and the default refresh rate is 1 minute.

## Panel descriptions

The following sections describe each group of panels on the dashboard, starting with **Job Info** and continuing through **GPU Metrics**, **Filesystem**, and **Node Resources**.

### Job Info

The **Job Info** section displays identifying information about the selected Slurm job, including:

| Panel                  | Displays                                                                                                                                                                                                                                                                 |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Job Id**             | The Slurm job ID.                                                                                                                                                                                                                                                        |
| **Name**               | The name of the Slurm job.                                                                                                                                                                                                                                               |
| **Last State**         | Displays the most recently reported [state of the Slurm job](/observability/managed-grafana).                                                                                                                                                                            |
| **User**               | The user who created the Slurm job.                                                                                                                                                                                                                                      |
| **Account**            | The user account running the Slurm job.                                                                                                                                                                                                                                  |
| **Nodes**              | The number of CPUs allocated for the Slurm job.                                                                                                                                                                                                                          |
| **Partition**          | The partition where the selected job is running.                                                                                                                                                                                                                         |
| **Job State Timeline** | A chart that shows the [Slurm job's states over time](/observability/managed-grafana).                                                                                                                                                                                   |
| **Uptime**             | The Slurm job uptime in seconds.                                                                                                                                                                                                                                         |
| **Active GPUs**        | The number of GPUs allocated to the Slurm job that are currently running.                                                                                                                                                                                                |
| **Job Efficiency**     | Indicates how active the GPUs were while working on the selected job. This value is estimated based on idle time, defined as a node with at least 1 GPU under 50% utilization. The estimate excludes restarts and checkpointing. This is not a Model FLOPS (MFU) metric. |
| **Current FP8 Flops**  | Graphs the Compute rate over the job run. The graph typically displays peaks and valleys in a regular pattern. Points with lower Compute values often correspond with loading data or saving checkpoints. This number assumes FP8 training.                              |
| **Node conditions**    | Lists nodes that have been assigned a condition, along with timestamps indicating when the condition was noticed and addressed.                                                                                                                                          |
| **Alerts**             | Displays all alerts with a severity level higher than **Informational**, over a 10-minute interval. Alerts may have a `PENDING` or `FIRING` state.                                                                                                                       |
| **Nodes (Range)**      | Lists all nodes along with their state, uptime, and a link to their associated `slurmd` logs.                                                                                                                                                                            |

#### Estimate model flop utilization (MFU) from tensor core utilization

Use the **Tensor Core Utilization Running Jobs** panel in the **GPU Metrics** section to get a rough estimate of MFU.

<img src="https://mintcdn.com/coreweave-dbfa0e8d/EysR2Tmy2So1ZVEh/observability/managed-grafana/_media/tensor-core.png?fit=max&auto=format&n=EysR2Tmy2So1ZVEh&q=85&s=3b89cc365cdd0c0a58fa7bf9d55c29f3" alt="Graph showing tensor core utilization." width="2255" height="404" data-path="observability/managed-grafana/_media/tensor-core.png" />

Although tensor core utilization is a hardware-level metric and MFU is an application-level metric, the **Tensor Core Utilization Running Jobs** panel acts as an upper bound for MFU. Your model's actual mathematical efficiency can't exceed the percentage of time the hardware's tensor cores are actively processing instructions. Use tensor core utilization as a hard upper bound and diagnostic resource.

The **Current FP8 FLOPS** panel isn't a good indicator of actual model FLOPS. It provides an idealized estimate based on tensor core utilization across all GPUs in the job. This metric assumes a specific scenario: an H100 GPU running FP8 operations with structured sparsity, where 100% tensor core utilization corresponds to 1979 TFLOPS per GPU.

#### Job Info: Job State Timeline and Last State

The **Job Info** section contains panels that display information about the state of the selected Slurm job. **Last State** displays the most recent reported state, while the **Job State Timeline** displays the job's status over time.

<img src="https://mintcdn.com/coreweave-dbfa0e8d/keZQSQUwILkhGUoe/observability/managed-grafana/_media/slurm-job-state-timeline.jpeg?fit=max&auto=format&n=keZQSQUwILkhGUoe&q=85&s=66a42f366ca1b698b0d15f665a1a041f" alt="Slurm Job State Timeline panel" width="1068" height="608" data-path="observability/managed-grafana/_media/slurm-job-state-timeline.jpeg" />

A Slurm job can be in the following states:

| State        | Meaning                                                                                 |
| ------------ | --------------------------------------------------------------------------------------- |
| `RUNNING`    | Actively executing on allocated resources.                                              |
| `PENDING`    | Queued to run when resources are available.                                             |
| `CANCELLED`  | Canceled by a user.                                                                     |
| `COMPLETING` | Finished or canceled, and currently performing cleanup tasks, such as an epilog script. |
| `PREEMPTED`  | Preempted by another job.                                                               |
| `COMPLETED`  | Successfully completed execution.                                                       |

#### Job Info: Node alerts

Job alerts in the Slurm Job / Metrics dashboard are generated by CoreWeave's Mission Control, an automated system that continuously monitors and manages the underlying compute infrastructure to maintain high cluster reliability and availability. These alerts target hardware and system-level issues, such as GPU errors, networking failures, and endpoint timeouts. These conditions aren't typically observable through application-layer metrics such as training metrics or standard logs.

##### Understanding Node alerts and conditions

The following image shows an interruption caused by a Node alert, namely `GPUContainedECCError`.

* Blue lines: Indicate Node conditions.
* Red lines: Indicate Node alerts.

<img src="https://mintcdn.com/coreweave-dbfa0e8d/bmSvzayNAcGFdaNU/observability/managed-grafana/_media/node-alerts.png?fit=max&auto=format&n=bmSvzayNAcGFdaNU&q=85&s=08f0d72a13b4989d9041e3731b6b8395" alt="Grafana dashboard showing Node alerts and conditions." width="2545" height="891" data-path="observability/managed-grafana/_media/node-alerts.png" />

The Node alert indicated by the red line, `GPUContainedECCError`, appears before the drop in compute, while the Node conditions indicated by the blue lines delineate the window the drop in compute occurred within.

The image shows that the H100 cluster experienced a hard fault around 15:15. Throughput dropped from 500 PFLOPS to 0 because the Slurm scheduler pulled a Node out of the pool due to a `GPUContainedECCError`. Understanding how Node alerts and conditions are overlaid on Node metrics can help you diagnose and troubleshoot problems with a running job.

### GPU Metrics

The **GPU Metrics** section displays detailed information related to hardware utilization. In this section, red lines correspond with higher temperature or utilization of the measured value, while green lines indicate a lower value or idle state. Whether these values suggest "good" or "bad" performance depends on the expected behavior and resource utilization of the job.

| Panel                                     | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| ----------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **GPU Temperatures Running Jobs**         | Displays the temperature of the GPUs over time. Generally, an increase in temperature corresponds with a job run, indicating that the GPUs are busy.                                                                                                                                                                                                                                                                                                                                                 |
| **GPU Core Utilization Running Jobs**     | Displays the utilization of GPU cores over time. Red indicates high utilization.                                                                                                                                                                                                                                                                                                                                                                                                                     |
| **SM Utilization Running Jobs**           | The fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors. "Active" does not necessarily mean that a warp is actively computing. For example, warps waiting on memory requests are considered active. The value represents an average over a time interval and is not an instantaneous value. A value of 0.8 or greater is necessary, but not sufficient, for effective use of the GPU. A value less than 0.5 likely indicates ineffective GPU usage. |
| **GPU Mem Copy Utilization Running Jobs** | Displays the utilization of GPU memory.                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| **Tensor Core Utilization Running Jobs**  | Displays the utilization of Tensor cores over time.                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| **VRAM Usage**                            | Plots the amount of VRAM used by the GPU over time.                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| **GPUs Temperature**                      | Displays the temperatures of the GPUs over time.                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| **GPUs Power Usage**                      | Plots the power usage of the GPUs, in watts.                                                                                                                                                                                                                                                                                                                                                                                                                                                         |

#### GPU Metrics: Color coding

| Color         | Meaning                      |
| ------------- | ---------------------------- |
| Red           | High utilization.            |
| Orange-Yellow | Medium-low utilization.      |
| Green         | Low utilization or idle.     |
| Black         | No job running at this time. |

<img src="https://mintcdn.com/coreweave-dbfa0e8d/d2qUH6M6caNZ6NLX/observability/managed-grafana/_media/slurm-gpu-metrics.jpeg?fit=max&auto=format&n=d2qUH6M6caNZ6NLX&q=85&s=261d593786c4b5fff10e360b537e10be" alt="GPU Metrics panels" width="3210" height="744" data-path="observability/managed-grafana/_media/slurm-gpu-metrics.jpeg" />

This example shows a high (red) value in the **GPU Core Utilization** panel, a medium (yellow) value in the **GPU Temperatures** panel, and low (green) values in the **GPU Mem Copy Utilization** panel. This indicates that the tracked Slurm job had high utilization of GPU Cores and low utilization of GPU memory, which may be expected for a small model size.

When you compare the fluctuations in GPU Temperature with the charts in the [**Filesystem**](#filesystem) section, you can see that drops in the GPU temperature may correspond with spikes in NFS `write` operations.

<img src="https://mintcdn.com/coreweave-dbfa0e8d/keZQSQUwILkhGUoe/observability/managed-grafana/_media/slurm-nfs-average-request-response.jpeg?fit=max&auto=format&n=keZQSQUwILkhGUoe&q=85&s=6131b2c9231a0b64c8d6d075eff61685" alt="NFS Average Request and Response panels" width="922" height="930" data-path="observability/managed-grafana/_media/slurm-nfs-average-request-response.jpeg" />

### Filesystem

The **Filesystem** section includes information about `read` and `write` operations on the Network File System (NFS) and local files.

| Panel                                            | Description                                                                                                                                                                                                               |
| ------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Local Max Disk IO Utilization (Min 1m)**       | The green line indicates `write` operations and the yellow line indicates `read` operations.                                                                                                                              |
| **Local Avg Bytes Read / Written Per Node (2m)** | The red line indicates `write` operations and the blue line indicates `read` operations.                                                                                                                                  |
| **Local Total bytes Read / Written (2m)**        | The red line indicates `write` operations and the blue line indicates `read` operations.                                                                                                                                  |
| **Local Total Read / Write Rate (2m)**           | The red line indicates `write` operations and the blue line indicates `read` operations.                                                                                                                                  |
| **NFS Average Request Time by Operation**        | Duration requests took from when a request was enqueued to when it was completely handled for a given operation, in seconds. The green line indicates `write` operations and the yellow line indicates `read` operations. |
| **NFS Avg Bytes Read / Written Per Node (2m)**   | The red line indicates `write` operations and the blue line indicates `read` operations.                                                                                                                                  |
| **NFS Total Bytes Read / Written (2m)**          | The red line indicates `write` operations and the blue line indicates `read` operations.                                                                                                                                  |
| **NFS Total Read / Write Rate**                  | The red line indicates `write` operations and the blue line indicates `read` operations.                                                                                                                                  |
| **NFS Average Response Time by Operation**       | Duration requests took to get a reply back after a request for a given operation was transmitted, in seconds. The green line indicates `write` operations and the yellow line indicates `read` operations.                |
| **NFS Avg Write Rate Per Active Node (2m)**      | The red line indicates `write` operations and the dashed green line displays **active nodes**. Only includes nodes writing over 10 KB/s.                                                                                  |
| **NFS Avg Read Rate Per Active Node (2m)**       | The blue line indicates `read` operations and the dashed green line displays **active nodes**. Only includes nodes reading over 10 KB/s.                                                                                  |
| **NFS Nodes with Retransmissions**               | Retransmissions indicate packet loss on the network, either due to congestion or faulty equipment or cabling.                                                                                                             |

<img src="https://mintcdn.com/coreweave-dbfa0e8d/keZQSQUwILkhGUoe/observability/managed-grafana/_media/slurm-nfs-average-read-write.jpeg?fit=max&auto=format&n=keZQSQUwILkhGUoe&q=85&s=19166a90c7d5393ef89dc54ad3a154d7" alt="NFS Read and Write panels" width="1846" height="926" data-path="observability/managed-grafana/_media/slurm-nfs-average-read-write.jpeg" />

#### Filesystem: NFS Average Response and Request

The **NFS Average Response and Request** graphs describe the performance of the filesystem. A slowdown or spike could indicate that the storage is too slow, and that the job might perform better with faster or a different type of storage, such as object storage.

#### Filesystem: NFS Total Read / Write

The **NFS Total Read / Write** graphs typically display a large red spike when a job starts, as the model and data are read in. While the job runs, the graph shows smaller write spikes at regular intervals, which occur as the checkpoints are written out. Compare these graphs with the panels in the [**GPU Metrics**](#gpu-metrics) section to help confirm that running jobs are behaving as expected.

### Node Resources

The **Node Resources** section includes the **CPU Allocation** panel, which displays the total number of CPU cores utilized over the job runtime.
