Kubernetes Training Jobs

Monitor Kubernetes training jobs

To view the dashboard, go to the Training Jobs dashboard.

Info

For accessing CoreWeave Grafana Dashboards instructions, see Access CoreWeave Grafana Dashboards.

The Kubernetes Training Job dashboard helps monitor hardware resources. It shows metrics for GPU utilization, network bandwidth (like InfiniBand), and storage I/O (both local and NFS), helping to diagnose performance bottlenecks and ensure compute-intensive tasks are running efficiently.

Panel Title	Description
Kind	Shows the Kubernetes object kind.
Name	Shows the name of the resource.
Nodes	Shows the total number of Nodes.
Pods	Shows the total number of Pods.
Uptime	Shows the overall uptime.
Pod Readiness Timeline	Shows the readiness state of the Pods.
Active GPUs	Shows the number of active GPUs.
Job Efficiency	Shows the efficiency of running jobs.
Current FP8 FLOPS	Displays the current floating-point operations per second in FP8 precision.
Node conditions	Displays the current conditions of the Nodes.
Alerts	Displays active alerts related to this resource.
Nodes (Range)	Shows the individual Pods, their running status, and uptime on each Node.
GPU Temperatures Running Jobs	Shows the GPU temperatures for running jobs.
GPU Core Utilization	Shows GPU core usage.
SM Utilization	Shows the utilization of streaming multiprocessors on the GPU.
GPU Mem Copy	Shows GPU memory copy operations.
Tensor Core Util	Shows the utilization of Tensor Cores.
Current FP8	Shows the current FP8 performance.
VRAM Usage	Displays the video RAM usage.
GPUs Temperature	Displays the temperature of the GPUs.
InfiniBand Aggregate Bandwidth	Shows the total network bandwidth over the InfiniBand interconnect.
GPUs Power Usage	Displays the power consumption of the GPUs.
Local Max Disk I/O Utilization (1m)	Shows the maximum disk I/O utilization on the local disk over 1 minute.
Local Avg Bytes Read / Written Per Nod	Shows the average bytes read/written per Node on the local disk.
Local Total Bytes Read / Written (2m)	Shows the total bytes read/written on the local disk over 2 minutes.
Local Total Read / Write Rate (2m)	Shows the total read/write rate on the local disk over 2 minutes.
NFS Average Request Time by Operation	Shows duration requests took from when a request was enqueued to when it was completely handled for a given operation, in seconds.
NFS Avg Bytes Read / Written Per Node	Shows the average bytes read/written per Node on the NFS.
NFS Total Bytes Read / Written (2m)	Shows the total bytes read/written on the NFS over 2 minutes.
NFS Total Read / Write Rate (2m)	Shows the total read/write rate on the NFS over 2 minutes.
NFS Average Response Time by Operation	Shows duration requests took to get a reply back after a request for a given operation was transmitted, in seconds.
NFS Avg Write Rate Per Active Node (2m)	Shows the average NFS write rate per active Node. Only includes Nodes reading/writing over 10KB/s.
NFS Avg Read Rate Per Active Node (2m)	Shows the average NFS read rate per active Node. Only includes Nodes reading/writing over 10KB/s.
NFS Nodes with Retransmissions	Shows the count of NFS Nodes experiencing network retransmissions.