Slurm / Job Metrics
View detailed metrics about a Slurm job with Grafana
The Slurm / Job Metrics dashboard displays detailed information about the performance and hardware utilization of a selected Slurm job.
Customers can use this dashboard to:
- Monitor the GPU and CPU utilization of a given Slurm job
- Track the rate of filesystem operations related to the Slurm job
- View node conditions and alerts that may impact the performance of the Slurm job
Prerequisites
You must be a member of the admin
, metrics
, or write
groups to access Grafana dashboards.
Open the dashboard
- Log in to the CoreWeave Cloud Console.
- In the left navigation, select Grafana to launch your Managed Grafana instance.
- Click Dashboards.
- Expand SUNK, then choose Slurm / Job Metrics.
- Select from the appropriate filters and parameters at the top of the page.
If you are already logged in to CoreWeave Cloud Console, you can open the Slurm / Job Metrics dashboard directly from this link.
Filters and parameters
Use these filters at the top of the page to choose the data you want to view:
Field | Value |
---|---|
Data Source | The Prometheus data source selector. |
Logs Source | The Loki logs source selector. |
Org | The organization that owns the cluster. |
Cluster | The specific Kubernetes cluster in the organization to view. |
Namespace | The Kubernetes namespace where the Slurm cluster is located. |
Slurm Cluster | The Slurm cluster containing the job to view. |
Job Name | The name of the Slurm job to view. Select "All" to view all the Slurm jobs in the Slurm Cluster. |
Job ID | The Slurm Job ID to view. Select "All" to view all the Slurm jobs in the Slurm Cluster. |
Node Conditions | Toggled on by default. Click to disable. |
Pending Node Alerts | Toggled on by default. Click to disable. |
Firing Node Alerts | Toggled off by default. Click to enable. |
InfiniBand Fabric Flaps | Toggled off by default. Click to enable. |
The dashboard also includes buttons with links to the SLURM / Namespace dashboard and slurmctld
logs.
Set the time range and refresh rate parameters at the top-right of the page. The default time range is 5 minutes, and the default refresh rate is 1 minute.
Panel descriptions
Job Info
The Job Info section displays identifying information about the selected Slurm job, including:
Panel | Displays |
---|---|
Job Id | The Slurm job ID. |
Name | The name of the Slurm job. |
Last State | Displays the most recently reported state of the Slurm job. |
User | The user who created the Slurm job. |
Account | The user account running the Slurm job. |
Nodes | The number of CPUs allocated for the Slurm job. |
Partition | The partition where the selected job is running. |
Job State Timeline | A chart that shows the Slurm job's states over time. |
Uptime | The Slurm job uptime in seconds. |
Active GPUs | The number of GPUs allocated to the Slurm job that are currently running. |
Job Efficiency | Indicates how active the GPUs were while working on the selected job. This value is estimated based on idle time, defined as a node with at least 1 GPU under 50% utilization. The estimate excludes restarts and checkpointing. This is not a Model FLOPS (MFU) metric. |
Current FP8 Flops | Graphs the Compute rate over the job run. The graph typically displays peaks and valleys in a regular pattern. Points with lower Compute values often correspond with loading data or saving checkpoints. This number assumes FP8 training. |
Node conditions | Lists nodes that have been assigned a condition, along with timestamps indicating when the condition was noticed and addressed. |
Alerts | Displays all alerts with a severity level higher than Informational, over a 10 minute interval. Alerts may have a PENDING or FIRING state. |
Nodes (Range) | Lists all nodes along with their state, uptime, and a link to their associated slurmd logs. |
Job Info: Job State Timeline and Last State
The Job Info section contains panels that display information about the state of the selected Slurm job. Last State displays the most recent reported state, while the Job State Timeline displays the job's status over time.
A Slurm job can be in the following states:
State | Meaning |
---|---|
RUNNING | Actively executing on allocated resources. |
PENDING | Queued to run when resources are available. |
CANCELLED | Cancelled by a user. |
COMPLETING | Finished or cancelled, and currently performing cleanup tasks, such as an epilog script. |
PREEMPTED | Preempted by another job. |
COMPLETED | Successfully completed execution. |
Job Info: Alerts
Job alerts in the Slurm Job / Metrics dashboard are generated by CoreWeave's Mission Control, an automated system that continuously monitors and manages the underlying compute infrastructure to maintain high cluster reliability and availability. These alerts target hardware and system-level issues, such as GPU errors, networking failures, and endpoint timeouts - conditions that are not typically observable through application-layer metrics such as training metrics or standard logs.
Alert | Description |
---|---|
IBFlappingMultipleNodeInterfaces | Multiple InfiniBand interfaces have observed a link flap at the same time. These usually indicated a larger problem in the InfiniBand fabric. Link flaps can have temporary impact on job performance. The InfiniBand fabric will be diagnosed. |
IBFlappingSingleNodeInterface | A single InfiniBand interface has observed a link flap. These can occur due to faulty optics. Individual link flaps can have temporary impact on job performance. The node will be taken out of service and the optic serviced after repeated flaps. |
IBHighBEROnNode | IBHighBEROnNode is an alert that triggers when a node's InfiniBand port experiences a Bit Error Rate (BER) that exceeds a threshold of 1e-12. |
IBRxErrorOnNode | The IBRxErrorOnNode alert indicates receive errors on an InfiniBand (IB) port. It is triggered when there are 15 or more receive errors detected within a 10-minute period on a node that has been up for more than 900 seconds. |
NodeIBLeafMisconnected | The NodeIBLeafMisconnected alert indicates that a node's InfiniBand connections to network leaves are either missing or connected to the wrong leaf switches. |
NodeIBLinkFault | The NodeIBLinkFault alert indicates that the InfiniBand bandwidth is degraded and the interface may be potentially lost. |
NodeIBPortMisconnected | The NodeIBPortMisconnected alert indicates that node-to-leaf ports are either missing or incorrectly connected. |
NodeIBSpeedUnexpected | The NodeIBSpeedUnexpected alert indicates that a node's InfiniBand network speed doesn't match the expected speed. When this alert triggers, it can be an early indicator of network issues. |
NodeIBSpeedUnexpectedIsolated | The NodeIBSpeedUnexpectedIsolated alert indicates that a node's InfiniBand network speed doesn't match the expected speed, and fewer than 5 nodes in the same leaf group are reporting this issue. |
GPUFallenOffBusHGX | GPU Fallen Off The Bus (Xid 79) indicates a fatal hardware error where the GPU shuts down and is completely inaccessible from the system. The node will immediately and automatically be taken out of service. |
DCGMThermalViolation | The DCGMThermalViolation alert is triggered when there is a non-zero rate of thermal violations detected by NVIDIA Data Center GPU Manager (DCGM). |
DCGMThrottleHWThermal | The GPU is being throttled for thermal reasons, usually related to some kind of cooling issue or, in rare cases, a broken temperature sensor. This can impact performance if it occurs over an extended period of time. |
ECCDoubleVolatileErrors | ECCDoubleVolatileErrors is an alert that indicates when DCGM double-bit volatile ECC (Error Correction Code) errors are increasing over a 5-minute period on a GPU. |
GPUContainedECCError | The GPUContainedECCError alert indicates that a GPU has experienced an XID Error 94, which is a Contained Uncorrectable ECC (Error Correction Code) error. |
GPUFaultHGX | A generic GPU Fault (Xid) has occurred. This may have an impact on jobs. The node will be reset when current run ends. |
GPUMemoryTemperatureHigh | The GPUMemoryTemperatureHigh alert triggers when a GPU's memory temperature exceeds 86°C for 2 minutes. This can be caused by a thermal problem, or firmware and driver bugs. Node will be reset after the current run ends. |
GPUMemoryTemperatureUnreal | The GPUMemoryTemperatureUnreal alert triggers when GPU memory temperature readings are not realistic or valid. This indicates a firmware bug or broken temperature sensor. As this can lead to GPU throttling, the node is immediately reset and job terminated. |
GPUVeryHot | GPU Significantly overheating. This is an indication of significant thermal problems or sensor malfunction. |
NodePCIErrorH100PLX | The NodePCIErrorH100PLX alert indicates a high rate of PCIe bus errors occurring on the PLX switch that connects H100 GPUs. |
NodeRepeatUCE | The NodeRepeatUCE alert indicates that a node has experienced frequent GPU Uncorrectable ECC (UCE) errors. |
NodeVerificationFailureIllegalMemoryAccess | The NodeVerificationFailureIllegalMemoryAccess alert indicates that a GPU has experienced an illegal memory access during CoreWeave's HPC verification testing, specifically during a training test. |
NodeNetworkReceiveErrs | NodeNetworkReceiveErrs alert indicates that a network interface has encountered receive errors exceeding a 1% threshold over a 2-minute period for 1 hour. |
KubeNodeNotReady | The KubeNodeNotReady alert indicates when a node's status condition is not Ready in a Kubernetes cluster. This alert can be an indicator of critical system health issues. |
KubeNodeNotReadyHGX | The KubeNodeNotReadyHGX alert indicates that a node has been unready or offline for more than 15 minutes. |
NodeCPUHZThrottle | The NodeCPUHZThrottle alert indicates that a node's CPU frequency has been throttled (running at a reduced speed) for at least 1 minute. This alert is specifically related to CPU performance, not GPU performance. |
NodeCPUHZThrottleLong | The NodeCPUHZThrottleLong alert indicates that a node's CPU frequency has been throttled below 201MHz for at least 30 minutes. |
NodeMemoryError | The NodeMemoryError alert indicates that a node has one or more bad DIMM (memory) modules. |
DPUUnexpectedPuntedRoutes | The DPUUnexpectedPuntedRoutes alert indicates a failure in offloading which can cause connectivity issues for the host. Node will be automatically reset to restore proper connectivity. |
ManyUCESingleBankH100 | The ManyUCESingleBankH100 alert triggers when there are two or more DRAM Uncorrectable Errors (UCEs) on the same row remapper bank of an H100 GPU. |
NodeLoadAverageHigh | The NodeLoadAverageHigh alert triggers when a node's load average exceeds 1000 for more than 15 minutes. |
NodeNVMEIOTimeout | The NodeNVMEIOTimeout alert indicates that a node is reporting NVMe I/O timeouts, which requires investigation of storage hardware and firmware for potential failures or updates. |
NodePCIErrorH100GPU | The NodePCIErrorH100GPU alert indicates when a GPU is experiencing PCI bus communication errors. |
NodeUncommandedReboot | The NodeUncommandedReboot alert indicates that a node has rebooted without being initiated by CoreWeave's Node Controller (CWNC) automation system. |
NvidiaDevicePluginPodNotReady | The NvidiaDevicePluginPodNotReady alert indicates that an nvidia-device-plugin pod has been unready for 30 minutes, which is considered an urgent scenario because it limits visibility into other issues. |
PendingStateExtendedTime | The PendingStateExtendedTime alert indicates that a node has been in a pending state for an extended period of time. This alert helps identify nodes that need to be removed from their current state but are stuck for an extended time. |
PendingStateExtendedTimeLowGpuUtil | The PendingStateExtendedTimeLowGpuUtil alert triggers when a node has been in a pending state for more than 10 days and has had less than 1% GPU utilization in the last hour. This alert helps indicate if a node needs to be removed from its current state but has been stuck for an extended time. |
UnknownNVMLErrorOnContainerStart | The "UnknownNVMLErrorOnContainerStart" alert typically indicates that a GPU has fallen off the bus or is experiencing hardware issues. |
GPU Metrics
The GPU Metrics section displays detailed information related to hardware utilization. In this section, red lines correspond with higher temperature or utilization of the measured value, while green lines indicate a lower value or idle state. Whether these values suggest "good" or "bad" performance depends on the expected behavior and resource utilization of the job.
Panel | Description |
---|---|
GPU Temperatures Running Jobs | Displays the temperature of the GPUs over time. Generally, an increase in temperature corresponds with a job run, indicating that the GPUs are busy. |
GPU Core Utilization Running Jobs | Displays the utilization of GPU cores over time. Red indicates high utilization. |
SM Utilization Running Jobs | The fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors. Note that "active" does not necessarily mean warp is actively computing. For instance, warps waiting on memory requests are considered active. The value represents an average over a time interval and is not an instantaneous value. A value of 0.8 or greater is necessary, but not sufficient, for effective use of the GPU. A value less than 0.5 likely indicates ineffective GPU usage. |
GPU Mem Copy Utilization Running Jobs | Displays the utilization of GPU memory. |
Tensor Core Utilization Running Jobs | Displays the utilization of Tensor cores over time. |
VRAM Usage | Plots the amount of VRAM used by the GPU over time. |
GPUs Temperature | Displays the temperatures of the GPUs over time. |
GPUs Power Usage | Plots the power usage of the GPUs, in Watts. |
GPU Metrics: Color coding
Color | Meaning |
---|---|
Red | High utilization. |
Orange-Yellow | Medium-low utilization. |
Green | Low utlization or idle. |
Black | No job running at this time. |
This example shows a high (red) value in the GPU Core Utilization panel, a medium (yellow) value in the GPU Temperatures panel, and low (green) values in the GPU Mem Copy Utilization panel. This indicates that the tracked Slurm job had high utilization of GPU Cores and low utilization of GPU memory, which may be expected for a small model size.
Comparing the fluctuations in GPU Temperature with the charts in the Filesystem section reveals that drops in the GPU temperature may correspond with spikes in NFS write
operations.
Filesystem
The Filesystem section includes information about read
and write
operations on the Network File System (NFS) and local files.
Panel | Description |
---|---|
Local Max Disk IO Utilization (Min 1m) | The green line indicates write operations and the yellow line indicates read operations. |
Local Avg Bytes Read / Written Per Node (2m) | The red line indicates write operations and the blue line indicates read operations. |
Local Total bytes Read / Written (2m) | The red line indicates write operations and the blue line indicates read operations. |
Local Total Read / Write Rate (2m) | The red line indicates write operations and the blue line indicates read operations. |
NFS Average Request Time by Operation | Duration requests took from when a request was enqueued to when it was completely handled for a given operation, in seconds. The green line indicates write operations and the yellow line indicates read operations. |
NFS Avg Bytes Read / Written Per Node (2m) | The red line indicates write operations and the blue line indicates read operations. |
NFS Total Bytes Read / Written (2m) | The red line indicates write operations and the blue line indicates read operations. |
NFS Total Read / Write Rate | The red line indicates write operations and the blue line indicates read operations. |
NFS Average Response Time by Operation | Duration requests took to get a reply back after a request for a given operation was transmitted, in seconds. The green line indicates write operations and the yellow line indicates read operations. |
NFS Avg Write Rate Per Active Node (2m) | The red line indicates write operations and the dashed green line displays active nodes. Only includes nodes writing over 10 KB/s. |
NFS Avg Read Rate Per Active Node (2m) | The blue line indicates read operations and the dashed green line displays active nodes. Only includes nodes reading over 10 KB/s. |
NFS Nodes with Retransmissions | Retransmissions indicate packet loss on the network, either due to congestion or faulty equipment or cabling. |
Filesystem: NFS Average Response and Request
The NFS Average Response and Request graphs describe the performance of the filesystem. A slowdown or spike could indicate that the storage is too slow, and that the job might perform better with faster or a different type of storage, such as object storage.
Filesystem: NFS Total Read / Write
The NFS Total Read / Write graphs typically display a large red spike when a job starts, as the model and data are read in. While the job runs, the graph shows smaller write spikes at regular intervals, which occur as the checkpoints are written out. Comparing these graphs with the panels in the GPU Metrics section may help to confirm that running jobs are behaving as expected.
Node Resources
The Node Resources section includes the CPU Allocation panel, which displays the total number of CPU cores utilized over the job runtime.