Skip to main content

Alerts and Monitoring

Configure e-mail alerts based on Prometheus metrics

From the notifications tab on the User Settings page, organization members may configure whether or not to receive emails for specific critical status alerts.

To receive email alerts for a specific metric, check the box beside the alert name. To stop receiving emails for a certain alert, uncheck the box beside the alert name.

important

At this time, turning on an email alert enables the selected alert for all namespaces within the organization - alerts may not currently be limited by namespace.

The following alerts may be turned on for email notifications:

NameMetric titleMeaning
Pod Crash LoopingKubePodCrashLoopingA Pod is stuck in state CrashLoopBackoff.
Job FailedKubeJobFailedA scheduled job has failed.
Quota Almost FullKubeQuotaAlmostFullA Kubernetes quota is at 97% of its allocated resource. Adjust resource usage, or request a quota increase.
Quota Fully UsedKubeQuotaFullyUsedA Kubernetes quota is at its limit; 100% of the resource is allocated. Adjust resource usage, or request a quota increase. Pod creation will be rejected.
S3 Object Storage Quota Has Reached 90%ObjectStorageQuotaLimitReached90WarningAn Object Storage quota is almost at its limit - 90% or more of the quota is allocated.
Persistent Volume Filling UpKubePersistentVolumeFillingUpA Persistent Volume (PVC)'s is at 97% capacity. Adjust usage, or adjust the PVC manifest.
(Learn more about Storage.)
Container Out of MemoryKubeContainerOomKillerA container has reached its resource limit or node maximum, and has been killed by Kubernetes.
Image Pull ErrorKubeImagePullErrorA Pod container cannot pull its image, and is stuck in either status ErrImagePull or ImagePullBackOff.
Spend Threshold ExceededSpendThresholdExceededCompares the balance of the Chargify subscription associated with the client's organization with a configured USD ($) threshold; alerts when this spend threshold is exceeded by any single namespace. Note that namespace usage is not aggregated; the threshold set here applies to all namespaces within the organization, but it applies to each one individually within the organization.

The spend threshold can be adjusted by setting the USD threshold explicitly in the threshold field on this alerts page.

important

Namespace spending is not aggregated; the threshold set here applies to all namespaces within the organization, but it applies to each one individually within the organization. As an arbitrary example, if the threshold is set to 20, and one namespace reaches 5 and the other reaches 15, the alert will not fire. Whereas, if one namespace reaches 5 and the other reaches 20, the alert will fire for the namespace that exceeded the threshold.

note
  • CoreWeave Cloud organizations created before November 14, 2023 will need to explicitly turn on any desired email notifications.
  • Organizations created after November 14, 2023 have all email notifications enabled by default. If specific alerts are not desired, they will need to be manually turned off from the User Settings page.

How alerts are generated

Alerts are given a unique fingerprint generated from the labels present on the Prometheus alert. In the example below, the fingerprint hash 07262e46c7ff7cfb is generated from labels in the previous stanza.

Example alert body

{
"status": "resolved",
"labels": {
"alertname": "TestAlert",
"cluster": "ord1-tenant",
"cluster_org": "coreweave",
"container": "kubernetes-ingress-controller",
"endpoint": "1024",
"instance": "10.240.77.25:1024",
"job": "cloud-app-kubernetes-ingress",
"namespace": "cloud",
"pod": "cloud-app-kubernetes-ingress",
"prometheus": "metrics/metrics-prometheus-operato-prometheus",
"prometheus_replica": "prometheus-metrics-prometheus-operato-prometheus-0",
"service": "cloud-app-kubernetes-ingress",
"severity": "critical",
"team": "customer",
},
"annotations": {"message": "Test Message!"},
"startsAt": "2023-10-03T19:41:11.462Z",
"endsAt": "2023-10-03T20:28:11.462Z",
"generatorURL": "/graph?g0.expr=up+%3D%3D+1&g0.tab=1",
"fingerprint": "07262e46c7ff7cfb",
}

If a new attempt to create a Pod or run a scheduled job also fails, then the labels on that alert will be different, which means that alert for the new attempt will receive a new unique fingerprint, which in turn will trigger a new alert to be sent to the user. Otherwise, each event sends only one alert to the user.