Introduction to CoreWeave Grafana
Learn about the available CoreWeave Grafana Dashboards
CKS provides a managed Grafana instance featuring multiple pre-configured dashboards optimized to display the metrics, events, and logs most relevant to your workloads.
CoreWeave Grafana does not support the creation of new dashboards or the modification of existing ones. If your workload requires a more customized solution, consider implementing a self-hosted Grafana instance in your CKS environment.
Available dashboards
CoreWeave provides a suite of custom dashboards, available by default in CoreWeave Grafana. The following provides a high-level overview of the categories and dashboards CoreWeave Grafana offers.
CoreWeave Kubernetes Service (CKS)
CKS dashboards provide insights into Control Plane, Kubernetes, and Pod-level logs and metrics.
Dashboard | Description |
---|---|
Control Plane Logs | Displays logs for various components of the Control Plane, including the API server, etcd, controller manager, and scheduler to monitor their health and status. |
Control Plane metrics | Provides a robust view of your cluster's health. |
Kubernetes Audit Logs | Displays Kubernetes audit logs, showing details of user activities and requests made to the Kubernetes API. |
Pod Inspector | Provides a combined view of common Pod metrics, Kubernetes events, and Pod logs, filtered to a specific Pod or container in any namespace. |
Fleet Management
Fleet Management dashboards provide insight into trends in resource consumption, state, and events for the Nodes and Pods in your clusters.
Dashboard | Description |
---|---|
Cabinet Visualizer | Provides statistics and historical information about each cabinet, its cooling system, and the rack it contains. From here you can monitor the overall health of the cabinet and see important details about each Node in the enclosed rack to identify issues and track historical trends. |
Cabinet Wrangler | Provides an overview of the available GPUs and racks with production-schedulable Nodes. |
Node Details | Provides a detailed breakdown of Node and GPU health, both at the physical hardware level and in Kubernetes. This dashboard contains additional log panels to allow simultaneous viewing of metrics along with the log and event types commonly needed to debug or analyze a Node. |
Node Wrangler | Provides visibility into the overall health and state of all Nodes in your clusters, along with their specs and hardware architectures. |
Knative
The Knative dashboards provide insight into the state of your Knative workloads, including revision metrics, Pod health, and resource consumption.
Dashboard | Description |
---|---|
Envoy Circuit Breaker | Displays specific metrics that monitor upstream connections and overflows in the Envoy data plane. It assists with observing and troubleshooting Knative connections from a Data Plane perspective. |
Global Revision | Offers a high-level overview across all namespaces within a specific cluster. |
Kourier Gateway | Tracks the state of Kourier Gateways for Knative, with panels for traffic metrics. It provides useful insight for debugging connection and networking issues specific to Knative deployments and revisions from an ingress perspective. |
Namespace Revision | Displays statistics per namespace. |
Revision Overview | Shows detailed statistics per Knative configuration. |
Learn more about Knative and Knative Revisions.
Kubernetes Training
Kubernetes dashboards provide detailed information regarding components and processes related to Kubernetes training jobs.
Dashboard | Description |
---|---|
Training Jobs | Provides a summary of Kubernetes training jobs, displaying information about Nodes, Pods, job efficiency, and active GPUs for training jobs. |
Network Backbone
The Network Backbone dashboards provide insight into the state of your network, including traffic metrics, connection counts, and latency.
Dashboard | Description |
---|---|
Internet Transit | Provides a near-real-time view into how much traffic flows through your backbone egress/ingress points, how that traffic is distributed across internal services, and how well the network is performing from an end-user perspective. |
Storage
The Storage dashboards provide performance and operational metrics for the CAIOS, distributed file, VAST, and WEKA storage systems.
Dashboard | Description |
---|---|
CAIOS LOTA | Displays metrics for the CAIOS LOTA storage system, focusing on requests, cache hit rates, and range requests. |
CAIOS Usage | Displays specific metrics for CAIOS to assist with monitoring and troubleshooting CAIOS performance. |
Distributed File Storage | Displays information about distributed file storage usage, including storage volume metrics and persistent volume claims status. |
VAST Actors | Displays metrics related to Vast storage performance, including NFS bandwidth, requests per second, metadata IOPS response time, and Node response times. |
WEKA | Provides information about the WEKA storage system, including cluster-wide info, protection status, cluster capacities, and performance metrics. |
SUNK
The SUNK dashboards provide insight into the state of your Slurm cluster, including job metrics, Node health, and resource consumption.
Dashboard | Description |
---|---|
Slurm Cluster | Provides an overview of the state of SUNK jobs within your clusters, with less focus on the specifics of a given job. It aggregates metrics by relevant properties, such as job state, resource type, and user, to illustrate trends in jobs. |
Slurm Job Metrics | Displays metrics used for debugging and analyzing the performance of specific jobs within your cluster. It provides detailed information about the Nodes running a given job, including alerts and states. |
To access these dashboards in a self-hosted Grafana instance, install the Helm Chart in the CoreWeave Charts repository.