Skip to main content

Cabinet Visualizer

Monitor the aggregate statistics of full cabinets with Grafana

The Cabinet Visualizer dashboard displays statistics and historical data for each cabinet, including its cooling system and enclosed rack. You can monitor overall cabinet health and view detailed information about each Node to identify issues and track performance trends over time.

Important

This dashboard is especially useful for monitoring the health of GB200 NVL72-powered Node Pools and their individual Nodes, as they are deployed only as full racks in dedicated cabinets.

Prerequisites

You must be a member of the admin, metrics, or write groups to access Grafana dashboards.

Get started

To open the Cabinet Visualizer dashboard:

  1. Log in to the CoreWeave Cloud Console.
  2. Click Grafana in the left menu to open the Managed Grafana instance.
  3. Click Dashboards in the left menu of Grafana.
  4. Expand the Fleet Management section to reveal the Cabinet Visualizer dashboard.
  5. Click Cabinet Visualizer to open the dashboard.
  6. In the top panel, select the Cluster org, Cluster, Zone, and NVLink domain you want to visualize.

Dashboard panel overview

The Cabinet Visualizer dashboard includes sections for aggregate statistics, GPU tray visualization, time-series graphs, and rack details. Each section provides different insights into the cabinet's performance and health.

Aggregate statistics

These panels on the upper left show the most recent high-level metrics for all Nodes in the cabinet.

MetricDescription
Current Average GPU UtilizationAverage percentage of GPU resources currently in use.
Current Average NVLink UtilizationAverage percentage of NVLink bandwidth currently in use.
Total FP8 FLOP/sTotal FP8-format floating-point operations per second.
Current Average GPU TemperatureAverage GPU temperature across all Nodes, in degrees Celsius.

GPU tray visualization

This panel on the lower left shows a visual layout of the enclosed rack, with each Node labeled by name. Color coding indicates the Node's NLCC state, Kubernetes state, and GPU temperature. Hover over any indicator for more details, or click a Node to view its full status.

Refer to the legend at the bottom of the panel to interpret the color codes.

Time-series graphs

These panels on the upper right show time-series graphs for aggregate NVLink bandwidth and GPU utilization across the cabinet. Use them to monitor performance trends and detect anomalies. Hover over any graph to view detailed data points. Use the list of Nodes beside each graph to filter the data by individual Node.

Rack details

This panel on the lower right provides detailed information about each Node in the rack.

ColumnDescriptionValues
NodeThe name of the Node
DeviceslotThe Deviceslot where the Node is installed
RUThe Rack Unit where the Node is physically located in the cabinet
NLCC StateThe current Node lifecycle stateAny valid Node lifecycle state
ReservedThe organization ID for this Node
K8s readyIndicates if the Node is online in Kubernetes
  • True: Online
  • False: Not online
Avg GPU TempAverage temperature of all GPUs on the Node
GPU P2PGPU P2P shows the peer-to-peer communication status between GPUs on the Node. This is required for any form of NVLink communication, both intra- and inter-tray
  • OK: The GPUs are peering correctly.
  • X: The GPU is peering with itself; this is ignored.
  • NS: Not Supported
HPC VerificationResult of the most recent HPC (High Performance Computing) validation checks run on the Node
  • Passed: The most recent run completed successfully.
  • Failed: The most recent run failed.
  • Not Run: A check has not run yet. This is common for newly delivered Nodes in a CKS cluster, since the verification data is stored on the Kubernetes host, not the Node itself.
AlertsAlert status for the Node
  • None
  • Pending
  • Firing
ActiveWhether the Node is currently running a workload Pod that is neither interruptible nor part of a DaemonSet
  • Yes: The Node is running a workload.
  • No: The Node is not running a workload.