Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
To view the dashboard, go to the SUNK Global dashboard.
The SUNK Global dashboard provides a high-level overview of SUNK infrastructure across all regions, cluster organizations, and clusters. It displays alerts, job distribution, software version information, and control plane health metrics.
Filters and parameters
Use these filters at the top of the page to choose the data you want to view:
| Field | Value |
|---|
| Data Source | The Prometheus data source selector. |
| Region | The region(s) to view. Supports multi-select. |
| Org | The cluster organization(s) to view. Supports multi-select. |
| Cluster | The specific Kubernetes cluster(s) to view. Supports multi-select. |
Set the time range and refresh rate parameters at the top-right of the page. The default time range is 1 hour.
Panel descriptions
SUNK Alerts
The SUNK Alerts section displays active alerts across all monitored SUNK infrastructure.
| Panel | Description |
|---|
| SUNK Alerts Timeline | A time-series graph showing when alerts fired. Each line represents an alert grouped by alertname, region, cluster_org, and cluster. |
| SUNK Alerts Table (Instant) | A table showing all currently firing SUNK alerts. Includes alertname, severity, team, region, cluster_org, cluster, namespace, node, nodeset, and pod information. Namespace, node, and pod values link to detailed dashboards. |
The SUNK Alerts Table provides clickable links to drill down into the specific Slurm Cluster Dashboard view for each alert, allowing you to investigate the context around firing alerts. You can also click node values to view Node Details or pod values to access the Pod Inspector for detailed troubleshooting.
Slurm Jobs
The Slurm Jobs section displays running Slurm job distribution.
| Panel | Description |
|---|
| Largest Job by Nodes | A donut chart showing the top 10 largest running jobs by node count. Clicking on a slice navigates to the Slurm Job Metrics dashboard. |
Software Versions
The Software Versions section displays version distribution across the SUNK infrastructure.
| Panel | Description |
|---|
| SUNK Versions Piechart | A donut chart showing SUNK version distribution across syncer containers. |
| SUNK Versions Table | A table listing SUNK versions by region, cluster_org, cluster, and namespace with node counts. |
| SUNK Operator Versions | A donut chart showing SUNK Operator (controller-manager) version distribution. |
| SUNK Operator Versions Table | A table listing SUNK Operator (controller-manager) versions by region, cluster_org, and cluster with node counts. |
| CUDA Versions Piechart | A donut chart showing CUDA version distribution across slurmd containers. |
| CUDA Versions Table | A table listing CUDA versions by region, cluster_org, cluster, and namespace with node counts. |
| OS Versions Piechart | A donut chart showing Ubuntu OS version distribution across SUNK daemon containers. |
| OS Versions Table | A table listing OS versions by region, cluster_org, cluster, and namespace with node counts. |
The SUNK Versions panels provide visibility into which version each cluster is using across your infrastructure. The piechart shows the distribution at a glance, while the table lets you see exactly which clusters are running specific versions, making it easier to plan and coordinate upgrades.
The CUDA Versions panels display CUDA version distribution across all slurmd containers. Use these panels to track which CUDA versions your clusters are using and plan upgrades across your infrastructure.
Control Plane Health
The Control Plane Health section displays SUNK control plane component status.
| Panel | Description |
|---|
| Control Plane Restart Counters | A time-series graph showing restart count increases for control plane containers (scheduler, slurmctld, slurmdbd, slurmrestd, syncer, mysql) over time. |
| Control Plane Readiness Status (Instant) | A table showing the current ready state of control plane pods. Includes Container, Pod, Status (Ready/Not Ready), region, cluster_org, cluster, namespace, and a link to view pod logs in Loki. |
The Control Plane Readiness Status table displays the current ready state of all control plane components. You can drill down into individual pods via the Pod Inspector or access Loki logs directly from the table to troubleshoot any control plane components.