Cabinet Visualizer

Monitor the aggregate statistics of full cabinets with Grafana

The Cabinet Visualizer dashboard displays statistics and historical data for each cabinet, including its cooling system and enclosed rack. You can monitor overall cabinet health and view detailed information about each Node to identify issues and track performance trends over time.

Important

This dashboard is especially useful for monitoring the health of GB200 NVL72-powered Node Pools and their individual Nodes, as they are deployed only as full racks in dedicated cabinets.

Prerequisites

You must be a member of the admin, metrics, or write groups to access Grafana dashboards.

Get started

To open the Cabinet Visualizer dashboard:

Log in to the CoreWeave Cloud Console.
Click Grafana in the left menu to open the Managed Grafana instance.
Click Dashboards in the left menu of Grafana.
Expand the Fleet Management section to reveal the Cabinet Visualizer dashboard.
Click Cabinet Visualizer to open the dashboard.
In the top panel, select the Cluster org, Cluster, Zone, and NVLink domain you want to visualize.

If you are already logged in to CoreWeave Cloud Console, you can open the Cabinet Visualizer dashboard directly from this link.

Dashboard panel overview

The Cabinet Visualizer dashboard includes sections for aggregate statistics, GPU tray visualization, time-series graphs, and rack details. Each section provides different insights into the cabinet's performance and health.

Aggregate statistics

These panels on the upper left show the most recent high-level metrics for all Nodes in the cabinet.

Metric	Description
Current Average GPU Utilization	Average percentage of GPU resources currently in use.
Current Average NVLink Utilization	Average percentage of NVLink bandwidth currently in use.
Total FP8 FLOP/s	Total FP8-format floating-point operations per second.
Current Average GPU Temperature	Average GPU temperature across all Nodes, in degrees Celsius.

GPU tray visualization

This panel on the lower left shows a visual layout of the enclosed rack, with each Node labeled by name. Color coding indicates the Node's NLCC state, Kubernetes state, and GPU temperature. Hover over any indicator for more details, or click a Node to view its full status.

Refer to the legend at the bottom of the panel to interpret the color codes.

Time-series graphs

These panels on the upper right show time-series graphs for aggregate NVLink bandwidth and GPU utilization across the cabinet. Use them to monitor performance trends and detect anomalies. Hover over any graph to view detailed data points. Use the list of Nodes beside each graph to filter the data by individual Node.

Rack details

This panel on the lower right provides detailed information about each Node in the rack.

Column	Description	Values
Node	The name of the Node
Deviceslot	The Deviceslot where the Node is installed
RU	The Rack Unit where the Node is physically located in the cabinet
NLCC State	The current Node lifecycle state	Any valid Node lifecycle state
Reserved	The organization ID for this Node
K8s ready	Indicates if the Node is online in Kubernetes	True: Online False: Not online
Avg GPU Temp	Average temperature of all GPUs on the Node
GPU P2P	GPU P2P shows the peer-to-peer communication status between GPUs on the Node. This is required for any form of NVLink communication, both intra- and inter-tray	OK: The GPUs are peering correctly. X: The GPU is peering with itself; this is ignored. NS: Not Supported
HPC Verification	Result of the most recent HPC (High Performance Computing) validation checks run on the Node	Passed: The most recent run completed successfully. Failed: The most recent run failed. Not Run: A check has not run yet. This is common for newly delivered Nodes in a CKS cluster, since the verification data is stored on the Kubernetes host, not the Node itself.
Alerts	Alert status for the Node	None Pending Firing
Active	Whether the Node is currently running a workload Pod that is neither interruptible nor part of a DaemonSet	Yes: The Node is running a workload. No: The Node is not running a workload.

Prerequisites​

Get started​

Dashboard panel overview​

Aggregate statistics​

GPU tray visualization​

Time-series graphs​

Rack details​