Skip to main content

Kubernetes Training Jobs

Monitor Kubernetes training jobs

Info

For accessing CoreWeave Grafana Dashboards instructions, see Access CoreWeave Grafana Dashboards.

The Kubernetes Training Job dashboard helps monitor hardware resources. It shows metrics for GPU utilization, network bandwidth (like InfiniBand), and storage I/O (both local and NFS), helping to diagnose performance bottlenecks and ensure compute-intensive tasks are running efficiently.

Panel TitleDescription
KindShows the Kubernetes object kind.
NameShows the name of the resource.
NodesShows the total number of Nodes.
PodsShows the total number of Pods.
UptimeShows the overall uptime.
Pod Readiness TimelineShows the readiness state of the Pods.
Active GPUsShows the number of active GPUs.
Job EfficiencyShows the efficiency of running jobs.
Current FP8 FLOPSDisplays the current floating-point operations per second in FP8 precision.
Node conditionsDisplays the current conditions of the Nodes.
AlertsDisplays active alerts related to this resource.
Nodes (Range)Shows the individual Pods, their running status, and uptime on each Node.
GPU Temperatures Running JobsShows the GPU temperatures for running jobs.
GPU Core UtilizationShows GPU core usage.
SM UtilizationShows the utilization of streaming multiprocessors on the GPU.
GPU Mem CopyShows GPU memory copy operations.
Tensor Core UtilShows the utilization of Tensor Cores.
Current FP8Shows the current FP8 performance.
VRAM UsageDisplays the video RAM usage.
GPUs TemperatureDisplays the temperature of the GPUs.
InfiniBand Aggregate BandwidthShows the total network bandwidth over the InfiniBand interconnect.
GPUs Power UsageDisplays the power consumption of the GPUs.
Local Max Disk I/O Utilization (1m)Shows the maximum disk I/O utilization on the local disk over 1 minute.
Local Avg Bytes Read / Written Per NodShows the average bytes read/written per Node on the local disk.
Local Total Bytes Read / Written (2m)Shows the total bytes read/written on the local disk over 2 minutes.
Local Total Read / Write Rate (2m)Shows the total read/write rate on the local disk over 2 minutes.
NFS Average Request Time by OperationShows duration requests took from when a request was enqueued to when it was completely handled for a given operation, in seconds.
NFS Avg Bytes Read / Written Per NodeShows the average bytes read/written per Node on the NFS.
NFS Total Bytes Read / Written (2m)Shows the total bytes read/written on the NFS over 2 minutes.
NFS Total Read / Write Rate (2m)Shows the total read/write rate on the NFS over 2 minutes.
NFS Average Response Time by OperationShows duration requests took to get a reply back after a request for a given operation was transmitted, in seconds.
NFS Avg Write Rate Per Active Node (2m)Shows the average NFS write rate per active Node. Only includes Nodes reading/writing over 10KB/s.
NFS Avg Read Rate Per Active Node (2m)Shows the average NFS read rate per active Node. Only includes Nodes reading/writing over 10KB/s.
NFS Nodes with RetransmissionsShows the count of NFS Nodes experiencing network retransmissions.