Skip to main content

Slurm / Job Metrics

View detailed metrics about a Slurm job with Grafana

The Slurm / Job Metrics dashboard displays detailed information about the performance and hardware utilization of a selected Slurm job.

Customers can use this dashboard to:

  • Monitor the GPU and CPU utilization of a given Slurm job
  • Track the rate of filesystem operations related to the Slurm job
  • View node conditions and alerts that may impact the performance of the Slurm job

Prerequisites

You must be a member of the admin, metrics, or write groups to access Grafana dashboards.

Open the dashboard

  1. Log in to the CoreWeave Cloud Console.
  2. In the left navigation, select Grafana to launch your Managed Grafana instance.
  3. Click Dashboards.
  4. Expand SUNK, then choose Slurm / Job Metrics.
  5. Select from the appropriate filters and parameters at the top of the page.

If you are already logged in to CoreWeave Cloud Console, you can open the Slurm / Job Metrics dashboard directly from this link.

Filters and parameters

Use these filters at the top of the page to choose the data you want to view:

FieldValue
Data SourceThe Prometheus data source selector.
Logs SourceThe Loki logs source selector.
OrgThe organization that owns the cluster.
ClusterThe specific Kubernetes cluster in the organization to view.
NamespaceThe Kubernetes namespace where the Slurm cluster is located.
Slurm ClusterThe Slurm cluster containing the job to view.
Job NameThe name of the Slurm job to view. Select "All" to view all the Slurm jobs in the Slurm Cluster.
Job IDThe Slurm Job ID to view. Select "All" to view all the Slurm jobs in the Slurm Cluster.
Node ConditionsToggled on by default. Click to disable.
Pending Node AlertsToggled on by default. Click to disable.
Firing Node AlertsToggled off by default. Click to enable.
InfiniBand Fabric FlapsToggled off by default. Click to enable.

The dashboard also includes buttons with links to the SLURM / Namespace dashboard and slurmctld logs.

Set the time range and refresh rate parameters at the top-right of the page. The default time range is 5 minutes, and the default refresh rate is 1 minute.

Panel descriptions

Job Info

The Job Info section displays identifying information about the selected Slurm job, including:

PanelDisplays
Job IdThe Slurm job ID.
NameThe name of the Slurm job.
Last StateDisplays the most recently reported state of the Slurm job.
UserThe user who created the Slurm job.
AccountThe user account running the Slurm job.
NodesThe number of CPUs allocated for the Slurm job.
PartitionThe partition where the selected job is running.
Job State TimelineA chart that shows the Slurm job's states over time.
UptimeThe Slurm job uptime in seconds.
Active GPUsThe number of GPUs allocated to the Slurm job that are currently running.
Job EfficiencyIndicates how active the GPUs were while working on the selected job. This value is estimated based on idle time, defined as a node with at least 1 GPU under 50% utilization. The estimate excludes restarts and checkpointing. This is not a Model FLOPS (MFU) metric.
Current FP8 FlopsGraphs the Compute rate over the job run. The graph typically displays peaks and valleys in a regular pattern. Points with lower Compute values often correspond with loading data or saving checkpoints. This number assumes FP8 training.
Node conditionsLists nodes that have been assigned a condition, along with timestamps indicating when the condition was noticed and addressed.
AlertsDisplays all alerts with a severity level higher than Informational, over a 10 minute interval. Alerts may have a PENDING or FIRING state.
Nodes (Range)Lists all nodes along with their state, uptime, and a link to their associated slurmd logs.

Job Info: Job State Timeline and Last State

The Job Info section contains panels that display information about the state of the selected Slurm job. Last State displays the most recent reported state, while the Job State Timeline displays the job's status over time.

A Slurm job can be in the following states:

StateMeaning
RUNNINGActively executing on allocated resources.
PENDINGQueued to run when resources are available.
CANCELLEDCancelled by a user.
COMPLETINGFinished or cancelled, and currently performing cleanup tasks, such as an epilog script.
PREEMPTEDPreempted by another job.
COMPLETEDSuccessfully completed execution.

GPU Metrics

The GPU Metrics section displays detailed information related to hardware utilization. In this section, red lines correspond with higher temperature or utilization of the measured value, while green lines indicate a lower value or idle state. Whether these values suggest "good" or "bad" performance depends on the expected behavior and resource utilization of the job.

PanelDescription
GPU Temperatures Running JobsDisplays the temperature of the GPUs over time. Generally, an increase in temperature corresponds with a job run, indicating that the GPUs are busy.
GPU Core Utilization Running JobsDisplays the utilization of GPU cores over time. Red indicates high utilization.
SM Utilization Running JobsThe fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors. Note that "active" does not necessarily mean warp is actively computing. For instance, warps waiting on memory requests are considered active. The value represents an average over a time interval and is not an instantaneous value. A value of 0.8 or greater is necessary, but not sufficient, for effective use of the GPU. A value less than 0.5 likely indicates ineffective GPU usage.
GPU Mem Copy Utilization Running JobsDisplays the utilization of GPU memory.
Tensor Core Utilization Running JobsDisplays the utilization of Tensor cores over time.
VRAM UsagePlots the amount of VRAM used by the GPU over time.
GPUs TemperatureDisplays the temperatures of the GPUs over time.
GPUs Power UsagePlots the power usage of the GPUs, in Watts.

GPU Metrics: Color coding

ColorMeaning
RedHigh utilization.
Orange-YellowMedium-low utilization.
GreenLow utlization or idle.
BlackNo job running at this time.

This example shows a high (red) value in the GPU Core Utilization panel, a medium (yellow) value in the GPU Temperatures panel, and low (green) values in the GPU Mem Copy Utilization panel. This indicates that the tracked Slurm job had high utilization of GPU Cores and low utilization of GPU memory, which may be expected for a small model size.

Comparing the fluctuations in GPU Temperature with the charts in the Filesystem section reveals that drops in the GPU temperature may correspond with spikes in NFS write operations.

Filesystem

The Filesystem section includes information about read and write operations on the Network File System (NFS) and local files.

PanelDescription
Local Max Disk IO Utilization (Min 1m)The green line indicates write operations and the yellow line indicates read operations.
Local Avg Bytes Read / Written Per Node (2m)The red line indicates write operations and the blue line indicates read operations.
Local Total bytes Read / Written (2m)The red line indicates write operations and the blue line indicates read operations.
Local Total Read / Write Rate (2m)The red line indicates write operations and the blue line indicates read operations.
NFS Average Request Time by OperationDuration requests took from when a request was enqueued to when it was completely handled for a given operation, in seconds. The green line indicates write operations and the yellow line indicates read operations.
NFS Avg Bytes Read / Written Per Node (2m)The red line indicates write operations and the blue line indicates read operations.
NFS Total Bytes Read / Written (2m)The red line indicates write operations and the blue line indicates read operations.
NFS Total Read / Write RateThe red line indicates write operations and the blue line indicates read operations.
NFS Average Response Time by OperationDuration requests took to get a reply back after a request for a given operation was transmitted, in seconds. The green line indicates write operations and the yellow line indicates read operations.
NFS Avg Write Rate Per Active Node (2m)The red line indicates write operations and the dashed green line displays active nodes. Only includes nodes writing over 10 KB/s.
NFS Avg Read Rate Per Active Node (2m)The blue line indicates read operations and the dashed green line displays active nodes. Only includes nodes reading over 10 KB/s.
NFS Nodes with RetransmissionsRetransmissions indicate packet loss on the network, either due to congestion or faulty equipment or cabling.

Filesystem: NFS Average Response and Request

The NFS Average Response and Request graphs describe the performance of the filesystem. A slowdown or spike could indicate that the storage is too slow, and that the job might perform better with faster or a different type of storage, such as object storage.

Filesystem: NFS Total Read / Write

The NFS Total Read / Write graphs typically display a large red spike when a job starts, as the model and data are read in. While the job runs, the graph shows smaller write spikes at regular intervals, which occur as the checkpoints are written out. Comparing these graphs with the panels in the GPU Metrics section may help to confirm that running jobs are behaving as expected.

Node Resources

The Node Resources section includes the CPU Allocation panel, which displays the total number of CPU cores utilized over the job runtime.