CoreWeave
Search
K

Alerts and Monitoring

Configure e-mail alerts based on Prometheus metrics
From the notifications tab on the User Settings page, organization members may configure whether or not to receive emails for specific critical status alerts.
To receive email alerts for a specific metric, check the box beside the alert name. To stop receiving emails for a certain alert, uncheck the box beside the alert name.
Important
At this time, turning on an email alert enables the selected alert for all namespaces within the organization - alerts may not currently be limited by namespace.
The following alerts may be turned on for email notifications:
Name
Metric title
Meaning
Pod Crash Looping
KubePodCrashLooping
A Pod is stuck in state CrashLoopBackoff.
Job Failed
KubeJobFailed
A scheduled job has failed.
Quota Almost Full
KubeQuotaAlmostFull
A Kubernetes quota is at 97% of its allocated resource. Adjust resource usage, or request a quota increase.
Quota Fully Used
KubeQuotaFullyUsed
A Kubernetes quota is at its limit; 100% of the resource is allocated. Adjust resource usage, or request a quota increase. Pod creation will be rejected.
S3 Object Storage Quota Has Reached 90%
ObjectStorageQuotaLimitReached90Warning
An Object Storage quota is almost at its limit - 90% or more of the quota is allocated.
Persistent Volume Filling Up
KubePersistentVolumeFillingUp
A Persistent Volume (PVC)'s is at 97% capacity. Adjust usage, or adjust the PVC manifest. (Learn more about Storage.)
Container Out of Memory
KubeContainerOomKiller
A container has reached its resource limit or node maximum, and has been killed by Kubernetes.
Image Pull Error
KubeImagePullError
A Pod container cannot pull its image, and is stuck in either status ErrImagePull or ImagePullBackOff.
Spend Threshold Exceeded
SpendThresholdExceeded
Compares the balance of the Chargify subscription associated with the client's organization with a configured USD ($) threshold; alerts when this spend threshold is exceeded by any single namespace. Note that namespace usage is not aggregated; the threshold set here applies to all namespaces within the organization, but it applies to each one individually within the organization.
The spend threshold can be adjusted by setting the USD threshold explicitly in the threshold field on this alerts page.
Important
Namespace spending is not aggregated; the threshold set here applies to all namespaces within the organization, but it applies to each one individually within the organization. As an arbitrary example, if the threshold is set to 20, and one namespace reaches 5 and the other reaches 15, the alert will not fire. Whereas, if one namespace reaches 5 and the other reaches 20, the alert will fire for the namespace that exceeded the threshold.
Note
  • CoreWeave Cloud organizations created before November 14, 2023 will need to explicitly turn on any desired email notifications.
  • Organizations created after November 14, 2023 have all email notifications enabled by default. If specific alerts are not desired, they will need to be manually turned off from the User Settings page.

How alerts are generated

Alerts are given a unique fingerprint generated from the labels present on the Prometheus alert. In the example below, the fingerprint hash 07262e46c7ff7cfb is generated from labels in the previous stanza.

Example alert body

{
"status": "resolved",
"labels": {
"alertname": "TestAlert",
"cluster": "ord1-tenant",
"cluster_org": "coreweave",
"container": "kubernetes-ingress-controller",
"endpoint": "1024",
"instance": "10.240.77.25:1024",
"job": "cloud-app-kubernetes-ingress",
"namespace": "cloud",
"pod": "cloud-app-kubernetes-ingress",
"prometheus": "metrics/metrics-prometheus-operato-prometheus",
"prometheus_replica": "prometheus-metrics-prometheus-operato-prometheus-0",
"service": "cloud-app-kubernetes-ingress",
"severity": "critical",
"team": "customer",
},
"annotations": {"message": "Test Message!"},
"startsAt": "2023-10-03T19:41:11.462Z",
"endsAt": "2023-10-03T20:28:11.462Z",
"generatorURL": "/graph?g0.expr=up+%3D%3D+1&g0.tab=1",
"fingerprint": "07262e46c7ff7cfb",
}
If a new attempt to create a Pod or run a scheduled job also fails, then the labels on that alert will be different, which means that alert for the new attempt will receive a new unique fingerprint, which in turn will trigger a new alert to be sent to the user. Otherwise, each event sends only one alert to the user.