Kubernetes Training Jobs
Monitor Kubernetes training jobs
Info
For accessing CoreWeave Grafana Dashboards instructions, see Access CoreWeave Grafana Dashboards.
The Kubernetes Training Job dashboard helps monitor hardware resources. It shows metrics for GPU utilization, network bandwidth (like InfiniBand), and storage I/O (both local and NFS), helping to diagnose performance bottlenecks and ensure compute-intensive tasks are running efficiently.
Panel Title | Description |
---|---|
Kind | Shows the Kubernetes object kind. |
Name | Shows the name of the resource. |
Nodes | Shows the total number of Nodes. |
Pods | Shows the total number of Pods. |
Uptime | Shows the overall uptime. |
Pod Readiness Timeline | Shows the readiness state of the Pods. |
Active GPUs | Shows the number of active GPUs. |
Job Efficiency | Shows the efficiency of running jobs. |
Current FP8 FLOPS | Displays the current floating-point operations per second in FP8 precision. |
Node conditions | Displays the current conditions of the Nodes. |
Alerts | Displays active alerts related to this resource. |
Nodes (Range) | Shows the individual Pods, their running status, and uptime on each Node. |
GPU Temperatures Running Jobs | Shows the GPU temperatures for running jobs. |
GPU Core Utilization | Shows GPU core usage. |
SM Utilization | Shows the utilization of streaming multiprocessors on the GPU. |
GPU Mem Copy | Shows GPU memory copy operations. |
Tensor Core Util | Shows the utilization of Tensor Cores. |
Current FP8 | Shows the current FP8 performance. |
VRAM Usage | Displays the video RAM usage. |
GPUs Temperature | Displays the temperature of the GPUs. |
InfiniBand Aggregate Bandwidth | Shows the total network bandwidth over the InfiniBand interconnect. |
GPUs Power Usage | Displays the power consumption of the GPUs. |
Local Max Disk I/O Utilization (1m) | Shows the maximum disk I/O utilization on the local disk over 1 minute. |
Local Avg Bytes Read / Written Per Nod | Shows the average bytes read/written per Node on the local disk. |
Local Total Bytes Read / Written (2m) | Shows the total bytes read/written on the local disk over 2 minutes. |
Local Total Read / Write Rate (2m) | Shows the total read/write rate on the local disk over 2 minutes. |
NFS Average Request Time by Operation | Shows duration requests took from when a request was enqueued to when it was completely handled for a given operation, in seconds. |
NFS Avg Bytes Read / Written Per Node | Shows the average bytes read/written per Node on the NFS. |
NFS Total Bytes Read / Written (2m) | Shows the total bytes read/written on the NFS over 2 minutes. |
NFS Total Read / Write Rate (2m) | Shows the total read/write rate on the NFS over 2 minutes. |
NFS Average Response Time by Operation | Shows duration requests took to get a reply back after a request for a given operation was transmitted, in seconds. |
NFS Avg Write Rate Per Active Node (2m) | Shows the average NFS write rate per active Node. Only includes Nodes reading/writing over 10KB/s. |
NFS Avg Read Rate Per Active Node (2m) | Shows the average NFS read rate per active Node. Only includes Nodes reading/writing over 10KB/s. |
NFS Nodes with Retransmissions | Shows the count of NFS Nodes experiencing network retransmissions. |