Alerts and Monitoring
Configure e-mail alerts based on Prometheus metrics
From the notifications tab on the User Settings page, organization members may configure whether or not to receive emails for specific critical status alerts.
To receive email alerts for a specific metric, check the box beside the alert name. To stop receiving emails for a certain alert, uncheck the box beside the alert name.
At this time, turning on an email alert enables the selected alert for all namespaces within the organization - alerts may not currently be limited by namespace.
The following alerts may be turned on for email notifications:
Name | Metric title | Meaning |
---|---|---|
Pod Crash Looping | KubePodCrashLooping | A Pod is stuck in state CrashLoopBackoff . |
Job Failed | KubeJobFailed | A scheduled job has failed. |
Quota Almost Full | KubeQuotaAlmostFull | A Kubernetes quota is at 97% of its allocated resource. Adjust resource usage, or request a quota increase. |
Quota Fully Used | KubeQuotaFullyUsed | A Kubernetes quota is at its limit; 100% of the resource is allocated. Adjust resource usage, or request a quota increase. Pod creation will be rejected. |
S3 Object Storage Quota Has Reached 90% | ObjectStorageQuotaLimitReached90Warning | An Object Storage quota is almost at its limit - 90% or more of the quota is allocated. |
Persistent Volume Filling Up | KubePersistentVolumeFillingUp | A Persistent Volume (PVC)'s is at 97% capacity. Adjust usage, or adjust the PVC manifest. (Learn more about Storage.) |
Container Out of Memory | KubeContainerOomKiller | A container has reached its resource limit or node maximum, and has been killed by Kubernetes. |
Image Pull Error | KubeImagePullError | A Pod container cannot pull its image, and is stuck in either status ErrImagePull or ImagePullBackOff . |
Spend Threshold Exceeded | SpendThresholdExceeded | Compares the balance of the Chargify subscription associated with the client's organization with a configured USD ($) threshold; alerts when this spend threshold is exceeded by any single namespace. Note that namespace usage is not aggregated; the threshold set here applies to all namespaces within the organization, but it applies to each one individually within the organization. |
The spend threshold can be adjusted by setting the USD threshold explicitly in the threshold field on this alerts page.
Namespace spending is not aggregated; the threshold set here applies to all namespaces within the organization, but it applies to each one individually within the organization. As an arbitrary example, if the threshold is set to 20
, and one namespace reaches 5
and the other reaches 15
, the alert will not fire. Whereas, if one namespace reaches 5
and the other reaches 20
, the alert will fire for the namespace that exceeded the threshold.
- CoreWeave Cloud organizations created before November 14, 2023 will need to explicitly turn on any desired email notifications.
- Organizations created after November 14, 2023 have all email notifications enabled by default. If specific alerts are not desired, they will need to be manually turned off from the User Settings page.
How alerts are generated
Alerts are given a unique fingerprint generated from the labels
present on the Prometheus alert. In the example below, the fingerprint
hash 07262e46c7ff7cfb
is generated from labels
in the previous stanza.
Example alert body
{"status": "resolved","labels": {"alertname": "TestAlert","cluster": "ord1-tenant","cluster_org": "coreweave","container": "kubernetes-ingress-controller","endpoint": "1024","instance": "10.240.77.25:1024","job": "cloud-app-kubernetes-ingress","namespace": "cloud","pod": "cloud-app-kubernetes-ingress","prometheus": "metrics/metrics-prometheus-operato-prometheus","prometheus_replica": "prometheus-metrics-prometheus-operato-prometheus-0","service": "cloud-app-kubernetes-ingress","severity": "critical","team": "customer",},"annotations": {"message": "Test Message!"},"startsAt": "2023-10-03T19:41:11.462Z","endsAt": "2023-10-03T20:28:11.462Z","generatorURL": "/graph?g0.expr=up+%3D%3D+1&g0.tab=1","fingerprint": "07262e46c7ff7cfb",}
If a new attempt to create a Pod or run a scheduled job also fails, then the labels
on that alert will be different, which means that alert for the new attempt will receive a new unique fingerprint, which in turn will trigger a new alert to be sent to the user. Otherwise, each event sends only one alert to the user.