Kubernetes Training Jobs
Monitor Kubernetes training jobs
To view the dashboard, go to the Training Jobs dashboard.
Info
For accessing CoreWeave Grafana Dashboards instructions, see Access CoreWeave Grafana Dashboards.
The Kubernetes Training Job dashboard helps monitor hardware resources. It shows metrics for GPU utilization, network bandwidth (like InfiniBand), and storage I/O (both local and NFS), helping to diagnose performance bottlenecks and ensure compute-intensive tasks are running efficiently.
| Panel Title | Description |
|---|---|
| Kind | Shows the Kubernetes object kind. |
| Name | Shows the name of the resource. |
| Nodes | Shows the total number of Nodes. |
| Pods | Shows the total number of Pods. |
| Uptime | Shows the overall uptime. |
| Pod Readiness Timeline | Shows the readiness state of the Pods. |
| Active GPUs | Shows the number of active GPUs. |
| Job Efficiency | Shows the efficiency of running jobs. |
| Current FP8 FLOPS | Displays the current floating-point operations per second in FP8 precision. |
| Node conditions | Displays the current conditions of the Nodes. |
| Alerts | Displays active alerts related to this resource. |
| Nodes (Range) | Shows the individual Pods, their running status, and uptime on each Node. |
| GPU Temperatures Running Jobs | Shows the GPU temperatures for running jobs. |
| GPU Core Utilization | Shows GPU core usage. |
| SM Utilization | Shows the utilization of streaming multiprocessors on the GPU. |
| GPU Mem Copy | Shows GPU memory copy operations. |
| Tensor Core Util | Shows the utilization of Tensor Cores. |
| Current FP8 | Shows the current FP8 performance. |
| VRAM Usage | Displays the video RAM usage. |
| GPUs Temperature | Displays the temperature of the GPUs. |
| InfiniBand Aggregate Bandwidth | Shows the total network bandwidth over the InfiniBand interconnect. |
| GPUs Power Usage | Displays the power consumption of the GPUs. |
| Local Max Disk I/O Utilization (1m) | Shows the maximum disk I/O utilization on the local disk over 1 minute. |
| Local Avg Bytes Read / Written Per Nod | Shows the average bytes read/written per Node on the local disk. |
| Local Total Bytes Read / Written (2m) | Shows the total bytes read/written on the local disk over 2 minutes. |
| Local Total Read / Write Rate (2m) | Shows the total read/write rate on the local disk over 2 minutes. |
| NFS Average Request Time by Operation | Shows duration requests took from when a request was enqueued to when it was completely handled for a given operation, in seconds. |
| NFS Avg Bytes Read / Written Per Node | Shows the average bytes read/written per Node on the NFS. |
| NFS Total Bytes Read / Written (2m) | Shows the total bytes read/written on the NFS over 2 minutes. |
| NFS Total Read / Write Rate (2m) | Shows the total read/write rate on the NFS over 2 minutes. |
| NFS Average Response Time by Operation | Shows duration requests took to get a reply back after a request for a given operation was transmitted, in seconds. |
| NFS Avg Write Rate Per Active Node (2m) | Shows the average NFS write rate per active Node. Only includes Nodes reading/writing over 10KB/s. |
| NFS Avg Read Rate Per Active Node (2m) | Shows the average NFS read rate per active Node. Only includes Nodes reading/writing over 10KB/s. |
| NFS Nodes with Retransmissions | Shows the count of NFS Nodes experiencing network retransmissions. |
Node alerts
| Alert Name | Description |
|---|---|
DCGMSRAMThresholdExceeded | This alert indicates that the SRAM threshold has been exceeded on a GPU. This indicates a memory issue and requires investigation by reliability teams. |
DPUContainerdThreadExhaustion | The DPUContainerdThreadExhaustion alert indicates that the containerd process has run out of threads on the DPU. This requires an update to the dpu-health container to patch. |
DPUContainerdThreadExhaustionCPX | The DPUContainerdThreadExhaustion alert indicates that the containerd process has run out of threads on the DPU. This requires an update to the dpu-health container to patch. |
DPULinkFlappingCPX | The DPULinkFlapping alert indicates that a DPU (Data Processing Unit) link has become unstable. It specifically triggers when a link on a DPU flaps (goes up and down) multiple times within a monitoring period. |
DPUNetworkFrameErrs | The DPUNetworkFrameErrs alert indicates frame errors occurring on DPU (Data Processing Unit) network interfaces. These errors typically indicate a problem with the physical network link. |
DPURouteCountMismatch | The DPURouteCountMismatch alert indicates an inconsistency in routes between what the DPU learns and has installed. A software component on the DPU will need to be restarted. |
DPURouteLoop | The DPURouteLoop alert indicates that a route loop has been detected on the DPU. This can be caused by a miscabling issue in the data center. |
DPURouteLoopCPX | The DPURouteLoop alert indicates that a route loop has been detected on the DPU. This can be caused by a miscabling issue in the data center. |
DPUUnexpectedPuntedRoutes | The DPUUnexpectedPuntedRoutes alert indicates a failure in offloading which can cause connectivity issues for the host. Node will be automatically reset to restore proper connectivity. |
DPUUnexpectedPuntedRoutesCPX | The DPUUnexpectedPuntedRoutes alert indicates a failure in offloading which can cause connectivity issues for the host. The issue typically occurs after a power reset (when the host reboots without the DPU rebooting). |
ECCDoubleVolatileErrors | ECCDoubleVolatileErrors is an alert that indicates when DCGM double-bit volatile ECC (Error Correction Code) errors are increasing over a 5-minute period on a GPU. |
GPUContainedECCError | GPU Contained ECC Error (Xid 94) indicates a uncorrectable memory error was encountered and contained. Workload has been impacted but the node is generally healthy. No action needed. |
GPUECCUncorrectableErrorUncontained | GPU Uncorrectable Error Uncontained (Xid 95) indicates a uncorrectable memory error was encountered but not successfully contained. Workload has been impacted and the node will be restarted. |
GPUFallenOffBus | GPU Fallen Off The Bus (Xid 79) indicates a fatal hardware error where the GPU shuts down and is completely inaccessible from the system. The node will immediately and automatically be taken out of service. |
GPUFallenOffBusHGX | GPU Fallen Off The Bus (Xid 79) indicates a fatal hardware error where the GPU shuts down and is completely inaccessible from the system. The node will immediately and automatically be taken out of service. |
GPUNVLinkSWDefinedError | NVLink SW Defined Error (Xid 155) indicates link down events which are flagged as intentional will trigger this Xid. Node will be reset. |
GPUPGraphicsEngineError | GPU Graphics Enginer Error (Xid 69) has impacted the workload but the node is generally healthy. No action needed. |
GPUPRowRemapFailure | GPU Row Remap Failure (Xid 64) is caused by a uncorrectable error resulting in a GPU memory remapping event that failed. The node will immediately and automatically be taken out of service. |
GPUTimeoutError | GPU Timeout Error (Xid 46) indicates GPU stopped processing and the node will be restarted. |
GPUUncorrectableDRAMError | GPU Uncorrectable DRAM Error (Xid 171) provides complementary information to Xid 48. No action is needed. |
GPUUncorrectableSRAMError | GPU Uncorrectable SRAM Error (Xid 172) provides complementary information to Xid 48. No action is needed. |
GPUVeryHot | The GPUVeryHot alert triggers when a GPU's temperature exceeds 90°C. |
KubeNodeNotReady | The KubeNodeNotReady alert indicates when a node's status condition is not Ready in a Kubernetes cluster. This alert can be an indicator of critical system health issues. |
KubeNodeNotReadyHGX | The KubeNodeNotReadyHGX alert indicates that a node has been unready or offline for more than 15 minutes. |
ManyUCESingleBankH100 | The ManyUCESingleBankH100 alert triggers when there are two or more DRAM Uncorrectable Errors (UCEs) on the same row remapper bank of an H100 GPU. |
MetalDevRedfishError | The MetalDevRedfishError alert indicates an out-of-band action against a BMC failed. |
NVL72GPUHighFECCKS | The NVL72GPUHighFECCKS alert indicates that a GPU is observing a high rate of forward error correction indicating signal integrity issues. |
NVL72SwitchHighFECCKS | The NVL72SwitchHighFECCKS alert indicates that a NVSwitch is observing a high rate of forward error correction indicating signal integrity issues. |
NVLinkDomainFullyTriaged | NVLinkDomainFullyTriaged indicates rack is entirely triaged. This rack should either be investigated for an unexpected rack level event or returned to fleet. |
NVLinkDomainProductionNodeCountLow | NVLinkDomainDegraded indicates rack has less nodes in a production state than expected. This rack will need manual intervention to either restore capacity or reclaim for further triage. |
NodeBackendLinkFault | The NodeBackendLinkFault alert indicates that the backend bandwidth is degraded and the interface may be potentially lost. |
NodeBackendMisconnected | Node-to-leaf ports are either missing or incorrectly connected. |
NodeCPUHZThrottleLong | An extended period of CPU frequency throttling has occured. CPU throttling most often occurs due to power delivery or thermal node level problems. The node will immediately and automatically be taken out of service and the job interrupted. |
NodeGPUNVLinkDown | The node is experiencing NVLink issues and will be automatically triaged. |
NodeGPUXID149NVSwitch | A GPU has experienced a fatal NVLink error. The node will be restarted to recover the GPU. |
NodeGPUXID149s4aLinkIssueFordPintoRepeated | A GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster. |
NodeGPUXID149s4aLinkIssueLamboRepeated | A GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster. |
NodeGPUXID149s4aLinkIssueNeedsUpgradeRepeated | A GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster. |
NodeLoadAverageHigh | The NodeLoadAverageHigh alert triggers when a node's load average exceeds 1000 for more than 15 minutes. |
NodeMemoryError | The NodeMemoryError alert indicates that a node has one or more bad DIMM (memory) modules. |
NodeNetworkReceiveErrs | NodeNetworkReceiveErrs alert indicates that a network interface has encountered receive errors exceeding a 1% threshold over a 2-minute period for 1 hour. |
NodePCIErrorH100GPU | The NodePCIErrorH100GPU alert indicates when a GPU is experiencing PCI bus communication errors. |
NodePCIErrorH100PLX | The NodePCIErrorH100PLX alert indicates a high rate of PCIe bus errors occurring on the PLX switch that connects H100 GPUs. |
NodeRepeatUCE | The NodeRepeatUCE alert indicates that a node has experienced frequent GPU Uncorrectable ECC (UCE) errors. |
NodeVerificationFailureNVFabric | The node is experiencing NVLink issues and will be automatically triaged. |
NodeVerificationMegatronDeadlock | HPC-Perftest failed megatron_lm test due to possible deadlock. Node should be triaged. |
PendingStateExtendedTime | The PendingStateExtendedTime alert indicates that a node has been in a pending state for an extended period of time. This alert helps identify nodes that need to be removed from their current state but are stuck for an extended time. |
PendingStateExtendedTimeLowGpuUtil | The PendingStateExtendedTimeLowGpuUtil alert triggers when a node has been in a pending state for more than 10 days and has had less than 1% GPU utilization in the last hour. This alert helps indicate if a node needs to be removed from its current state but has been stuck for an extended time. |
UnknownNVMLErrorOnContainerStart | The UnknownNVMLErrorOnContainerStart alert typically indicates that a GPU has fallen off the bus or is experiencing hardware issues. |