Monitor Slurm jobs with CoreWeave Grafana
The following sections describe how to open Grafana and locate the dashboard used to inspect Slurm job activity. To open CoreWeave Grafana, first log in to the CoreWeave Cloud Console. From the Console, either navigate to Grafana from the left-hand navigation or access it directly athttps://cks-grafana.coreweave.com. From there, select Dashboards, then select the Slurm / Job metrics dashboard.
Slurm job metrics dashboard

48. The Job State Timeline in the lower left-hand corner shows that the job ran for over one hour.

Pending, Running, Complete, and so on) chronologically. This information can be useful for debugging and performance analysis.
On the right, you can see two key metrics:
- Job Efficiency captures an estimate of GPU usage during the job’s run.
- Current FP8 FLOPS shows the compute rate as the job runs. It is common to see a regular pattern of peaks and valleys as the job runs. Typically, areas with less compute usage appear when the job loads data or saves checkpoints.

GPU metrics
Further down the dashboard, GPU-specific panels help you confirm that the hardware runs as expected during the run. The included GPU metrics display information on the job’s run. At its most abstracted level, the presence of red indicates that the given aspect of the GPU is working hard, whereas the presence of green signifies that the given part of the GPU is working less hard. This example runs a 124M cell GPT2 test model on two L40 nodes, without InfiniBand.
- The GPU temperature rises (progressing from a green average into an orange average) while the job runs. This is a clear indication that the GPUs are busy.
- The GPU Core Utilization is high, while GPU Mem is less used, an expected outcome for a small model.

Filesystem metrics
Storage performance can affect training throughput, so the dashboard also surfaces filesystem activity for the job. Even further down the dashboard is information on the filesystem. Two metrics of importance displayed on the bottom left-hand side of the dashboard are the NFS Average Response/Request and NFS Total Read/Write rate.- The NFS Average graphs indicate how well the filesystem performs. A slowdown or spike indicates that the storage is slowing down, and that the job may perform better with faster or different storage, such as CoreWeave AI Object Storage.
- The NFS Total Read / Write Rate demonstrates the total
readandwriteoperations on the filesystem. When a job first begins, it is typical to see a largereadspike as the job reads in the model and data. While the job runs, a regularwritespike is typical. This indicates when the job writes out checkpoints.