DCGM_FI_DEV_GPU_UTIL metric from NVIDIA Data Center GPU Manager (DCGM).
Prerequisites
Before you create these alert rules, you need the following:- A self-hosted Grafana instance with the CoreWeave Metrics data source configured
- A contact point to receive alert notifications
About the DCGM GPU utilization metric
DCGM_FI_DEV_GPU_UTIL is a gauge metric from the NVIDIA driver that measures overall GPU utilization as a percentage (0 to 100) per device. CoreWeave exposes this metric through the CoreWeave Metrics data source.
The examples on this page exclude activity in the
cw-hpc-verification namespace. CoreWeave runs GPU verification tests in this namespace when GPUs are idle. These tests are preempted by actual workloads and do not represent real utilization.Alert on average cluster GPU utilization
This alert fires when the average GPU utilization across all GPUs in a cluster falls below a threshold for a sustained period. To create this alert rule:- In Grafana, navigate to Alerting > Alert rules, then select New alert rule.
- Select Grafana-managed rule as the rule type.
-
In the query editor, switch to Code mode and enter the following PromQL expression. Replace
[CLUSTER-NAME]with the name of your cluster.This expression calculates the average GPU utilization across all GPUs in the cluster over a 24-hour window. -
Under Expressions, configure the threshold condition. For example, set the condition to fire when the value is less than
50. - Configure the evaluation interval and pending period to control how frequently the rule is evaluated and how long the condition must persist before the alert fires.
- Select the contact point you created in the prerequisites to receive notifications.
- Select Save rule and exit.
50), the time window (24h), and the cluster name to match your requirements.
Alert on the proportion of idle GPUs
Instead of tracking overall average utilization, this alert fires when a certain proportion of individual GPUs in a cluster are completely idle. This approach uses two queries and a math expression:- In Grafana, navigate to Alerting > Alert rules, then select New alert rule.
- Select Grafana-managed rule as the rule type.
-
In the query editor, add two queries in Code mode. Replace
[CLUSTER-NAME]with the name of your cluster. Query A counts the number of idle GPUs in the cluster, where a GPU is idle if its average utilization over the past 24 hours is zero. Thesum by (node, gpu)clause isolates each individual GPU before the count:Query B counts the total number of GPUs in the cluster: -
Add a Math expression that divides Query A by Query B to calculate the proportion of idle GPUs:
-
Under Expressions, set the threshold condition. For example, set the condition to fire when the math expression result is greater than
0.5(meaning more than 50% of GPUs are idle). - Configure the evaluation interval and pending period.
- Select the contact point you created in the prerequisites to receive notifications.
- Select Save rule and exit.
Learn more
For more information, explore the following resources:- Create Grafana-managed alert rules in the Grafana documentation
- PromQL documentation for building custom metric queries
- CoreWeave Alerts for built-in alert integrations that do not require self-hosted Grafana