To view the dashboard, go to the Node Details dashboard.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
For accessing CoreWeave Grafana dashboards instructions, see Access and use CoreWeave Grafana dashboards.
Example view
When the dashboard first loads, the Summary panel group is expanded by default. The Summary section is shown in this example.
Panel groups
The dashboard has several expandable panel groups. Each focuses on a different aspect of Node telemetry. When viewing the dashboard, click these group headings to collapse or expand that section.| Section | What you’ll find |
|---|---|
| Summary | High-level “at-a-glance” cards for host identity, CPU/GPU utilization, network traffic, current alerts, and HPC verification status. |
| Logs | A rolling log view that merges Kubernetes events, kernel messages, Node Problem Detector findings, and recent alert annotations. Ideal for pinpointing the root cause of spikes or failures. |
| Network | End-to-end connectivity metrics: ICMP loss from the Ping Exporter, conntrack table usage, interface throughput, packet error rates, and more. |
| Resources | Capacity and utilization for CPU, GPU, memory, NFS mounts, and local disk I/O, plus historical usage charts to identify resource pressure. |
| GPUs | Everything GPU-related: ECC errors, power draw, clock speeds, memory usage, thermal headroom, fan RPM, and NVSwitch/NVLink health. |
| Temperatures | Real-time thermal data for GPU cores, HBM memory, motherboard sensors, and per-Pod thermal impact: helpful for identifying cooling hot spots. |
| InfiniBand | Port state, link speed, retransmit counts, and congestion indicators for Nodes equipped with InfiniBand adapters. |
| Slurm Info | Node state within your Slurm cluster (idle, alloc, drain), running jobs, and allocation timelines: useful for mixed Kubernetes-Slurm environments. |
| Kubernetes | Pod allocation breakdowns, taints, Reservation status, CPU-hours per job, memory by namespace, and Calico network policy statistics. |
| Verification | Results from the latest HPC verification suite, including GPU compute benchmarks, stress tests, and sanity checks. |
| Hardware | Low-level chassis data: IPMI power readings, fan speeds, serial numbers, PCIe link status, and NVLink bandwidth graphs. |
| Unsorted | Additional metrics that do not yet belong to a dedicated group; check here for new or experimental panels. |