Skip to main content

Slurm Block Topology

Monitor Slurm Block Topology

To view the dashboard, go to the Slurm Block Topology dashboard.

Info

For accessing CoreWeave Grafana Dashboards instructions, see Access CoreWeave Grafana Dashboards.

The Slurm Block Topology dashboard shows the availability, allocation, and status of compute Nodes, specifically organized by a block topology.

The Slurm Block Topology dashboard is especially helpful for visualizing and understanding metrics for GPUs connected through rack level NVLink fabrics like the GB200 and GB300 instance types.

For documentation related to segment size, see the segments section in Topology/Block Scheduling in Slurm.

PanelDescription
Idle Nodes by Segment SizeDisplays information about how many idle Nodes would be available for jobs at a certain segment size.
Idle Nodes by all Segment SizesShows a breakdown of all idle Node counts categorized by different segment sizes.
Idle Segments by Segment SizeShows the amount of segments at a given size that can be allocated to each of the blocks. This panel is useful to see the potential scheduling targets by listing all the segments across blocks that have the required amount of Nodes.

How to use: Select the segment size you would like to use for your job. The recommended max segment size is 16. Then check this panel to see how your job's segments can be allocated across the different blocks in your cluster.
Idle Nodes Per Block by Segment SizeShows the amount of nodes that can be allocated to each of the blocks. Helps determine whether to attempt a larger segment size for single-rack or multi-rack jobs.

How to use: Start by selecting a segment size, for example 7, in the selector. The dashboard updates to show all blocks that have at least 7 idle Nodes, and for each block, it displays the total number of Nodes available in that block. For example, if you need 30 Nodes total and expect to use 5 blocks, you can quickly see if there are blocks with 10 or more Nodes available. This allows you to change to segment size to 10 and use 3 racks instead of 5. Using less racks is recommended because communication overhead increases as the number of racks increases, which reduces performance.
Idle Nodes per BlockShows the count of all idle Nodes within each block, regardless of their segment size.
Allocated Nodes per BlockDisplays the number of Nodes that are currently allocated or in use, aggregated per block.
Slurm Nodes StatusShows the status of all Slurm Nodes.
PanelDescription
Nodes Allocated by Job per BlockShows the count of nodes allocated within each block for each running job in the cluster.

Shows Nodes Allocated by Job per Block panel:

PanelDescription
Nodes Allocated by Block per JobShows the amount of nodes used by different running jobs for each block in the cluster.

Shows Nodes Allocated by Block per Job panel: