Get Started with Managed Grafana
Accessing CoreWeave's managed Grafana instance
CKS provides a managed Grafana instance featuring multiple pre-configured dashboards optimized to display the metrics, events, and logs most relevant to your workloads.
CoreWeave's Managed Grafana instance does not support the creation of new dashboards or the modification of existing ones. If your workload requires a more customized solution, consider implementing a self-hosted Grafana instance in your CKS environment.
Accessing Managed Grafana
The CoreWeave Managed Grafana instance is hosted at https://cks-grafana.coreweave.com
. To access the Managed Grafana instance directly from the Console UI, click the bar graph icon on the far left of the screen.
You must be logged in to Cloud Console in order to access the Managed Grafana instance. Access to the dashboards, and visibility of the Console button, is only available to users in the admin
, metrics
, or write
groups.
Grafana Explore
Grafana Explore is now available on CoreWeave's Managed Grafana instance. Explore enables customers to use the full expressivity of PromQL and LogQL for ad hoc queries on metrics and logs and to view metadata such as which metrics are available, or which labels are available to use for filtering PromQL queries.
While CoreWeave's default dashboards offer curated views of specific logs and metrics, the Explore tab allows users to more deeply investigate and learn about metrics.
Available dashboards
CoreWeave provides a suite of custom dashboards, available by default in Managed Grafana.
To access these dashboards in a self-hosted Grafana instance, install the Helm Chart in the CoreWeave Charts repository.
CAIOS Usage
The CAIOS Usage dashboard displays specific metrics for CoreWeave AI Object Storage to assist with monitoring and troubleshooting CoreWeave AI Object Storage performance.
Pod Inspector
The Pod Inspector dashboard provides a combined view of common Pod metrics, Kubernetes events, and Pod logs, filtered to a specific Pod or container in any namespace.
SUNK dashboards
The SUNK dashboards provide insight into the state of your Slurm cluster, including job metrics, Node health, and resource consumption.
SLURM / Job Metrics
The SLURM / Job Metrics dashboard displays metrics used for debugging and analyzing the performance of specific jobs within your cluster. It provides detailed information about the Nodes running a given job, including alerts and states.
SLURM / Namespace
The SLURM / Namespace dashboard provides an overview of the state of SUNK jobs within your clusters, with less focus on the specifics of a given job. It aggregates metrics by relevant properties, such as job state, resource type, and user, to illustrate trends in jobs.
Knative dashboards
The Knative dashboards provide insight into the state of your Knative workloads, including revision metrics, Pod health, and resource consumption.
Learn more aboutKnative and Knative Revisions.
Kourier Gateway
The Kourier Gateway dashboard tracks the state of Kourier Gateways for Knative, with panels for traffic metrics. It provides useful insight for debugging connection and networking issues specific to Knative deployments and revisions from an ingress perspective.
Envoy Circuit Breaker
The Envoy Circuit Breaker dashboard displays specific metrics that monitor upstream connections and overflows in the Envoy data plane. It assists with observing and troubleshooting Knative connections from a data plane perspective.
Revision dashboards
The Knative Revision dashboards all provide similar types of Knative metrics, but at different levels of aggregation:
- The Global Revision Overview dashboard offers a high-level overview across all namespaces within a specific cluster.
- The Namespace Revision Overview dashboard displays statistics per namespace.
- The Revision Overview dashboard shows detailed statistics per Knative configuration.
Fleet Management dashboards
The Fleet Management dashboards provide insight into trends in resource consumption, state, and events for the Nodes and Pods in your clusters.
Node Details
The Node Details dashboard provides a detailed breakdown of Node and GPU health, both at the physical hardware level and in Kubernetes. This dashboard contains additional log panels to allow simultaneous viewing of metrics along with the log and event types commonly needed to debug or analyze a Node.
Node Wrangler
The Node Wrangler dashboard provides visibility into the overall health and state of all Nodes in your clusters, along with their specs and hardware architectures.
Cabinet Visualizer
The Cabinet Visualizer dashboard provides statistics and historical information about each cabinet, its cooling system, and the rack it contains. From here you can monitor the overall health of the cabinet and see important details about each Node in the enclosed rack to identify issues and track historical trends.