Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

CKS provides a managed Grafana instance featuring multiple pre-configured dashboards optimized to display the metrics, events, and logs most relevant to your workloads.
CoreWeave Grafana does not support the creation of new dashboards or the modification of existing ones. If your workload requires a more customized solution, consider implementing a self-hosted Grafana instance in your CKS environment.

Available dashboards

CoreWeave provides a suite of custom dashboards, available by default in CoreWeave Grafana. The following provides a high-level overview of the categories and dashboards CoreWeave Grafana offers.

CoreWeave

The CoreWeave home dashboard provides an overview of your CoreWeave environment, including announcements, environment-wide metrics, and live service status.
DashboardDescriptionDashboard link
HomeDisplays platform announcements, top-level GPU, Pod, and storage metrics, links to key dashboards, and CoreWeave service status. To learn more, read the Grafana Home documentation.Home dashboard in Grafana

CoreWeave Kubernetes Service (CKS)

CKS dashboards provide insights into Control Plane, Kubernetes, and Pod-level logs and metrics.
DashboardDescriptionDashboard link
Control Plane LogsDisplays logs for various components of the Control Plane, including the API server, etcd, controller manager, and scheduler to monitor their health and status. To learn more, go to Control Plane Logs.Control Plane Logs
Control Plane MetricsProvides a robust view of your cluster’s health. To learn more, go to Control Plane Metrics.Control Plane Metrics
Kubernetes Audit LogsDisplays Kubernetes audit logs, showing details of user activities and requests made to the Kubernetes API. To learn more, go to Kubernetes Audit Logs.Kubernetes Audit Logs
Kueue Metrics DashboardDisplays Kueue-related metrics regarding workload health and status. To learn more, go to Kueue Metrics Dashboard.Kueue Metrics Dashboard
Pod InspectorProvides a combined view of common Pod metrics, Kubernetes events, and Pod logs, filtered to a specific Pod or container in any namespace. To learn more, go to Pod Inspector.Pod Inspector

Fleet Management

Fleet Management dashboards provide insight into trends in resource consumption, state, and events for the Nodes and Pods in your clusters.
DashboardDescriptionDashboard link
Cabinet VisualizerProvides statistics and historical information about each cabinet, its cooling system, and the rack it contains. From here you can monitor the overall health of the cabinet and see important details about each Node in the enclosed rack to identify issues and track historical trends. To learn more, go to Cabinet Visualizer.Cabinet Visualizer
Cabinet WranglerProvides an overview of the available GPUs and racks with production-schedulable Nodes. To learn more, go to Cabinet Wrangler.Cabinet Wrangler
Node DetailsProvides a detailed breakdown of Node and GPU health, both at the physical hardware level and in Kubernetes. This dashboard contains additional log panels to allow simultaneous viewing of metrics along with the log and event types commonly needed to debug or analyze a Node. To learn more, go to Node Details.Node Details
Node WranglerProvides visibility into the overall health and state of all Nodes in your clusters, along with their specs and hardware architectures. To learn more, go to Node Wrangler.Node Wrangler

Kubernetes Training

Kubernetes dashboards provide detailed information regarding components and processes related to Kubernetes training jobs.
DashboardDescriptionDashboard link
Training JobsProvides a summary of Kubernetes training jobs, displaying information about Nodes, Pods, job efficiency, and active GPUs for training jobs. To learn more, go to Training Jobs.Training Jobs

Network Backbone

The Network Backbone dashboards provide insight into the state of your network, including traffic metrics, connection counts, and latency.
DashboardDescriptionDashboard link
Internet TransitProvides a near-real-time view into how much traffic flows through your backbone egress/ingress points, how that traffic is distributed across internal services, and how well the network is performing from an end-user perspective. To learn more, go to Internet Transit.Internet Transit

Storage

The Storage dashboards provide performance and operational metrics for the CAIOS, distributed file, VAST, and WEKA storage systems.
DashboardDescriptionDashboard link
CAIOS LOTADisplays metrics for the CAIOS LOTA storage system, focusing on requests, cache hit rates, and range requests. To learn more, go to CAIOS LOTA.CAIOS LOTA
CAIOS UsageDisplays specific metrics for CAIOS to assist with monitoring and troubleshooting CAIOS performance. To learn more, go to CAIOS Usage.CAIOS Usage
Distributed File Storage UsageDisplays information about distributed file storage usage, including storage volume metrics and persistent volume claims status. To learn more, go to Distributed File Storage Usage.Distributed File Storage Usage
VAST ActorsDisplays metrics related to Vast storage performance, including NFS bandwidth, requests per second, metadata IOPS response time, and Node response times. To learn more, go to VAST Actors.VAST Actors
WEKAProvides information about the WEKA storage system, including cluster-wide info, protection status, cluster capacities, and performance metrics. To learn more, go to WEKA.WEKA

SUNK

The SUNK dashboards provide insight into the state of your Slurm cluster, including job metrics, Node health, and resource consumption.
DashboardDescriptionDashboard link
Slurm Block TopologyProvides an overview of the availability, allocation, and status of compute Nodes, specifically organized by a block topology. To learn more, go to Slurm Block Topology.Slurm Block Topology
Slurm ClusterProvides an overview of the state of SUNK jobs within your clusters, with less focus on the specifics of a given job. It aggregates metrics by relevant properties, such as job state, resource type, and user, to illustrate trends in jobs. To learn more, go to Slurm Cluster.Slurm Cluster
Slurm Job MetricsDisplays metrics used for debugging and analyzing the performance of specific jobs within your cluster. It provides detailed information about the Nodes running a given job, including alerts and states. To learn more, go to Slurm Job Metrics.Slurm Job Metrics
SUNK GlobalProvides a high-level overview of SUNK infrastructure across all regions, cluster organizations, and clusters. To learn more, go to SUNK GlobalSUNK Global
To access these dashboards in a self-hosted Grafana instance, install the Helm Chart in the CoreWeave Charts repository.
Last modified on April 13, 2026