Custom GPU utilization alerts

You can create custom alert rules for conditions such as low GPU utilization in a self-hosted Grafana instance with CoreWeave Metrics configured as a data source. CoreWeave Grafana does not support custom alert rules. This page walks through two example alert rules that monitor GPU utilization using the DCGM_FI_DEV_GPU_UTIL metric from NVIDIA Data Center GPU Manager (DCGM).

Prerequisites

Before you create these alert rules, you need the following:

A self-hosted Grafana instance with the CoreWeave Metrics data source configured
A contact point to receive alert notifications

About the DCGM GPU utilization metric

DCGM_FI_DEV_GPU_UTIL is a gauge metric from the NVIDIA driver that measures overall GPU utilization as a percentage (0 to 100) per device. CoreWeave exposes this metric through the CoreWeave Metrics data source.

The examples on this page exclude activity in the cw-hpc-verification namespace. CoreWeave runs GPU verification tests in this namespace when GPUs are idle. These tests are preempted by actual workloads and do not represent real utilization.

Alert on average cluster GPU utilization

This alert fires when the average GPU utilization across all GPUs in a cluster falls below a threshold for a sustained period. To create this alert rule:

In Grafana, navigate to Alerting > Alert rules, then select New alert rule.
Select Grafana-managed rule as the rule type.
In the query editor, switch to Code mode and enter the following PromQL expression. Replace [CLUSTER-NAME] with the name of your cluster.
avg( avg_over_time( DCGM_FI_DEV_GPU_UTIL{cluster="[CLUSTER-NAME]", namespace!~"cw-hpc-verification"}[24h] ) )
This expression calculates the average GPU utilization across all GPUs in the cluster over a 24-hour window.
Under Expressions, configure the threshold condition. For example, set the condition to fire when the value is less than 50.
Configure the evaluation interval and pending period to control how frequently the rule is evaluated and how long the condition must persist before the alert fires.
Select the contact point you created in the prerequisites to receive notifications.
Select Save rule and exit.

Adjust the threshold (50), the time window (24h), and the cluster name to match your requirements.

Alert on the proportion of idle GPUs

Instead of tracking overall average utilization, this alert fires when a certain proportion of individual GPUs in a cluster are completely idle. This approach uses two queries and a math expression:

In Grafana, navigate to Alerting > Alert rules, then select New alert rule.
Select Grafana-managed rule as the rule type.
In the query editor, add two queries in Code mode. Replace [CLUSTER-NAME] with the name of your cluster. Query A counts the number of idle GPUs in the cluster, where a GPU is idle if its average utilization over the past 24 hours is zero. The sum by (node, gpu) clause isolates each individual GPU before the count:
count( sum by (node, gpu) ( avg_over_time( DCGM_FI_DEV_GPU_UTIL{cluster="[CLUSTER-NAME]", namespace!~"cw-hpc-verification"}[24h] ) ) == 0 )
Query B counts the total number of GPUs in the cluster:
count( sum by (node, gpu) ( DCGM_FI_DEV_GPU_UTIL{cluster="[CLUSTER-NAME]"} ) )
Add a Math expression that divides Query A by Query B to calculate the proportion of idle GPUs:
$A / $B
Under Expressions, set the threshold condition. For example, set the condition to fire when the math expression result is greater than 0.5 (meaning more than 50% of GPUs are idle).
Configure the evaluation interval and pending period.
Select the contact point you created in the prerequisites to receive notifications.
Select Save rule and exit.

Learn more

For more information, explore the following resources:

Create Grafana-managed alert rules in the Grafana documentation
PromQL documentation for building custom metric queries
CoreWeave Alerts for built-in alert integrations that do not require self-hosted Grafana

​Prerequisites

​About the DCGM GPU utilization metric

​Alert on average cluster GPU utilization

​Alert on the proportion of idle GPUs

​Learn more

Prerequisites

About the DCGM GPU utilization metric

Alert on average cluster GPU utilization

Alert on the proportion of idle GPUs

Learn more