> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Kubernetes Training Jobs

> Grafana dashboard for monitoring hardware resources during training jobs, including GPU and InfiniBand

To view the dashboard, go to the [Training Jobs dashboard](https://cks-grafana.coreweave.com/d/k8s-training-jobs/kubernetes-training-jobs).

<Info>
  For instructions about accessing CoreWeave Grafana dashboards, see [Access and use CoreWeave Grafana dashboards](/observability/managed-grafana/access).
</Info>

The Kubernetes Training Job dashboard helps monitor hardware resources. It shows metrics for GPU utilization, network bandwidth (like InfiniBand), and storage I/O (both local and NFS), helping to diagnose performance bottlenecks and ensure compute-intensive tasks are running efficiently.

| Panel Title                                 | Description                                                                                                                        |
| ------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| **Kind**                                    | Shows the Kubernetes object kind.                                                                                                  |
| **Name**                                    | Shows the name of the resource.                                                                                                    |
| **Nodes**                                   | Shows the total number of Nodes.                                                                                                   |
| **Pods**                                    | Shows the total number of Pods.                                                                                                    |
| **Uptime**                                  | Shows the overall uptime.                                                                                                          |
| **Pod Readiness Timeline**                  | Shows the readiness state of the Pods.                                                                                             |
| **Active GPUs**                             | Shows the number of active GPUs.                                                                                                   |
| **Job Efficiency**                          | Shows the efficiency of running jobs.                                                                                              |
| **Current FP8 FLOPS**                       | Displays the current floating-point operations per second in FP8 precision.                                                        |
| **Node conditions**                         | Displays the current conditions of the Nodes.                                                                                      |
| **Alerts**                                  | Displays active alerts related to this resource.                                                                                   |
| **Nodes (Range)**                           | Shows the individual Pods, their running status, and uptime on each Node.                                                          |
| **GPU Temperatures Running Jobs**           | Shows the GPU temperatures for running jobs.                                                                                       |
| **GPU Core Utilization**                    | Shows GPU core usage.                                                                                                              |
| **SM Utilization**                          | Shows the utilization of streaming multiprocessors on the GPU.                                                                     |
| **GPU Mem Copy**                            | Shows GPU memory copy operations.                                                                                                  |
| **Tensor Core Util**                        | Shows the utilization of Tensor Cores.                                                                                             |
| **Current FP8**                             | Shows the current FP8 performance.                                                                                                 |
| **VRAM Usage**                              | Displays the video RAM usage.                                                                                                      |
| **GPUs Temperature**                        | Displays the temperature of the GPUs.                                                                                              |
| **InfiniBand Aggregate Bandwidth**          | Shows the total network bandwidth over the InfiniBand interconnect.                                                                |
| **GPUs Power Usage**                        | Displays the power consumption of the GPUs.                                                                                        |
| **Local Max Disk I/O Utilization (1m)**     | Shows the maximum disk I/O utilization on the local disk over 1 minute.                                                            |
| **Local Avg Bytes Read / Written Per Nod**  | Shows the average bytes read/written per Node on the local disk.                                                                   |
| **Local Total Bytes Read / Written (2m)**   | Shows the total bytes read/written on the local disk over 2 minutes.                                                               |
| **Local Total Read / Write Rate (2m)**      | Shows the total read/write rate on the local disk over 2 minutes.                                                                  |
| **NFS Average Request Time by Operation**   | Shows duration requests took from when a request was enqueued to when it was completely handled for a given operation, in seconds. |
| **NFS Avg Bytes Read / Written Per Node**   | Shows the average bytes read/written per Node on the NFS.                                                                          |
| **NFS Total Bytes Read / Written (2m)**     | Shows the total bytes read/written on the NFS over 2 minutes.                                                                      |
| **NFS Total Read / Write Rate (2m)**        | Shows the total read/write rate on the NFS over 2 minutes.                                                                         |
| **NFS Average Response Time by Operation**  | Shows duration requests took to get a reply back after a request for a given operation was transmitted, in seconds.                |
| **NFS Avg Write Rate Per Active Node (2m)** | Shows the average NFS write rate per active Node. Only includes Nodes reading/writing over 10KB/s.                                 |
| **NFS Avg Read Rate Per Active Node (2m)**  | Shows the average NFS read rate per active Node. Only includes Nodes reading/writing over 10KB/s.                                  |
| **NFS Nodes with Retransmissions**          | Shows the count of NFS Nodes experiencing network retransmissions.                                                                 |
