Cabinet Visualizer
Monitor the aggregate statistics of full cabinets with Grafana
The Cabinet Visualizer dashboard displays statistics and historical data for each cabinet, including its cooling system and enclosed rack. You can monitor overall cabinet health and view detailed information about each Node to identify issues and track performance trends over time.
This dashboard is especially useful for monitoring the health of GB200 NVL72-powered Node Pools and their individual Nodes, as they are deployed only as full racks in dedicated cabinets.
Prerequisites
You must be a member of the admin
, metrics
, or write
groups to access Grafana dashboards.
Get started
To open the Cabinet Visualizer dashboard:
- Log in to the CoreWeave Cloud Console.
- Click Grafana in the left menu to open the Managed Grafana instance.
- Click Dashboards in the left menu of Grafana.
- Expand the Fleet Management section to reveal the Cabinet Visualizer dashboard.
- Click Cabinet Visualizer to open the dashboard.
- In the top panel, select the Cluster org, Cluster, Zone, and NVLink domain you want to visualize.
Dashboard panel overview
The Cabinet Visualizer dashboard includes sections for aggregate statistics, GPU tray visualization, time-series graphs, and rack details. Each section provides different insights into the cabinet's performance and health.
Aggregate statistics
These panels on the upper left show the most recent high-level metrics for all Nodes in the cabinet.
Metric | Description |
---|---|
Current Average GPU Utilization | Average percentage of GPU resources currently in use. |
Current Average NVLink Utilization | Average percentage of NVLink bandwidth currently in use. |
Total FP8 FLOP/s | Total FP8-format floating-point operations per second. |
Current Average GPU Temperature | Average GPU temperature across all Nodes, in degrees Celsius. |
GPU tray visualization
This panel on the lower left shows a visual layout of the enclosed rack, with each Node labeled by name. Color coding indicates the Node's NLCC state, Kubernetes state, and GPU temperature. Hover over any indicator for more details, or click a Node to view its full status.
Refer to the legend at the bottom of the panel to interpret the color codes.
Time-series graphs
These panels on the upper right show time-series graphs for aggregate NVLink bandwidth and GPU utilization across the cabinet. Use them to monitor performance trends and detect anomalies. Hover over any graph to view detailed data points. Use the list of Nodes beside each graph to filter the data by individual Node.
Rack details
This panel on the lower right provides detailed information about each Node in the rack.
Column | Description | Values |
---|---|---|
Node | The name of the Node | |
Deviceslot | The Deviceslot where the Node is installed | |
RU | The Rack Unit where the Node is physically located in the cabinet | |
NLCC State | The current Node lifecycle state | Any valid Node lifecycle state |
Reserved | The organization ID for this Node | |
K8s ready | Indicates if the Node is online in Kubernetes |
|
Avg GPU Temp | Average temperature of all GPUs on the Node | |
GPU P2P | GPU P2P shows the peer-to-peer communication status between GPUs on the Node. This is required for any form of NVLink communication, both intra- and inter-tray |
|
HPC Verification | Result of the most recent HPC (High Performance Computing) validation checks run on the Node |
|
Alerts | Alert status for the Node |
|
Active | Whether the Node is currently running a workload Pod that is neither interruptible nor part of a DaemonSet |
|