Topology/Block Scheduling in Slurm
CoreWeave's GB200 NVL72-powered instances harness the groundbreaking architecture of NVIDIA's GB200 Grace Blackwell Superchip and NVLink Switch System, integrated within a single rack. The NVLink fabric enables direct memory access between GPUs, allowing for ultra-fast communication and efficient data sharing.
To fully leverage the NVL72 architecture, CoreWeave recommends using the Topology/Block Plugin with Slurm. This plugin improves job scheduling by placing Nodes from the same job within the same rack whenever possible, maximizing NVLink fabric performance. Optimizing job placement within the NVLink domain increases resource efficiency and enhances GPU communication—critical for large-scale distributed workloads where minimizing processing time is essential.
Concepts
The Slurm Topology/Block Plugin adds new features for managing job placement within NVLink domains. Key concepts include Blocks, Segments, and the --exclusive=topo
option.
When SUNK auto-generates the topology file, it defines Blocks based on the ds.coreweave.com/nvlink.domain
label on CKS Nodes. For GB200 Nodes, this label is a globally unique rack identifier—for example, DH4-016-US-EAST-02A
.
In clusters with both NVLink-enabled multi-Node systems and standard Nodes, SUNK creates a topology.conf file that includes Block and switch definitions. This configuration does not affect behavior unless the Topology/Block Plugin is explicitly enabled, ensuring safe integration and runtime flexibility.
Blocks
A Block is a group of Nodes in the same Slurm cluster, defined in the topology.conf
file. This file is automatically generated by SUNK, so you don't need to create or manage it manually. It specifies which Nodes belong to each Block and defines the Block size—the minimum number of Nodes in a Block.
Each Block has a unique ID and cannot overlap with others. CoreWeave dynamically updates the topology configuration to reflect changes in Node availability. By default, each NVL72 system corresponds to a Block containing 18 Nodes in the same NVLink domain. If any Nodes become unhealthy, the CoreWeave platform removes them from the SUNK cluster. However, the block size stays fixed at 18. As a result, some Blocks may temporarily have fewer than 18 available Nodes.
Segments
A Segment is a set of Nodes within a Block that are allocated together. It must be contained within a single Block and cannot cross Block boundaries. Segments reduce resource fragmentation by keeping allocations intact. If a job spans multiple Blocks, Slurm schedules equal-sized Segments in each Block. A Block can hold multiple Segments if space allows.
By default, if no segment size is provided, Slurm uses the full Block size.
Best practice: Always use --segment=1
unless your job requires a larger segment. This makes your job more flexible by allowing it to run on any available Nodes across all Blocks, improving scheduling efficiency.
If you omit the segment size and your job needs more Nodes than are free in a single Block, Slurm may fail to schedule it. To prevent this, always set the segment size explicitly.
All examples in this guide follow this best practice by specifying the segment size to ensure reliable scheduling.
Exclusive option
The --exclusive=topo
option ensures that only the job being submitted runs in a Block. The job won't share a Block with any other job, and no new jobs can be placed there until the original job completes. Unused Nodes in the Block remain idle. Use this option for benchmarking or to avoid resource competition.
Using --exclusive=topo
can result in idle Nodes if your job doesn't fill the entire Block. To avoid wasting resources, apply this option only when needed—such as for benchmarking or when your workload requires exclusive access to a Block without sharing with other jobs.
Example Scenarios
The following scenarios show how to use the Topology/Block Plugin to optimize job placement within NVL72 systems. Each diagram shows multiple 18-Node NVL72 systems (Blocks). Node colors represent status:
- Green: Nodes running the requested job (light green indicates a second job or Segment)
- Gray: Unavailable—drained, down, or running unrelated jobs
- White: Idle
A job that fits in a Block
A job requests 10 Nodes, which is less than the Block size. The segment size is set to 1, following best practice.
srun --nodes=10 --segment=1 ...
The plugin allocates 10 Nodes from Block 1. The rest of Block 1 and Block 2 remain idle.
Jobs larger than the Block size
A job requests 20 Nodes, which exceeds the Block size of 18. The segment size is set to 1, following best practice.
srun --nodes=20 --segment=1 ...
Two Nodes are unavailable in Block 1, so the plugin allocates 16 Nodes from Block 1 and 4 Nodes from Block 2. The rest of Block 2 remains idle.
Two jobs in a single Block
Two jobs are submitted. The first requests 8 Nodes; the second requests 4. The segment size is set to 1 for both jobs, following best practice.
srun --nodes=8 --segment=1 ... && srun --nodes=4 --segment=1 ...
The plugin places both jobs in Block 1. The rest of Block 1 and Block 2 remain idle.
Realistic training example: A job with two Segments
A job requests 32 Nodes, divided into two Segments of 16 Nodes each.
srun --nodes=32 --segment=16 ...
The plugin places the first Segment in Block 1 and the second Segment in Block 2.
Two Nodes are left idle in Block 1 bBecause Segments cannot span Blocks.
Exclusive jobs
A job requests 16 Nodes with the --exclusive=topo
option. The segment size is set to 1, following best practice.
srun --nodes=16 --segment=1 --exclusive=topo ...
The plugin places the job in Block 2 because --exclusive=topo
prevents it from running in Block 1 while a competing job (light green) is active. The rest of Block 2 remains idle, and no other jobs are allowed there until this job completes.
Combining Segments and the exclusive option
A job requests 16 Nodes, divided into two 8-Node Segments, using --exclusive=topo
.
srun --nodes=16 --segment=8 --exclusive=topo ...
The plugin places both Segments (green and light green) in Block 2 because Block 1 is unavailable due to an existing job (dark green). The rest of Block 2 remains idle, and no other jobs are allowed there until both job Segments complete.
Identify idle Nodes in a block
You can use the script at /usr/share/sunk/bin/segment-calc.sh
on the Slurm login node to check for idle Nodes within a Block. By default, it reports idle Nodes per topology Block. You can optionally provide the --segment N
argument to report only those Blocks with at least N
idle Nodes available.
Example usage
$ ./segment-calc.sh --segment 6Block IDLE / SIZE----------------------------------- ------ -----block_DH4-004-US-EAST-02A 8 / 18block_DH4-012-US-EAST-02A 8 / 18block_DH4-013-US-EAST-02A 8 / 18block_DH4-017-US-EAST-02A 8 / 18block_DH4-043-US-EAST-02A 18 / 18Blocks with ≥ 6 idle nodes : 5Whole segments schedulable : 7Total nodes those segments consume : 42
This output shows the Blocks with at least 6 idle Nodes, along with the total number of whole segments that can be scheduled and the total Nodes those segments would consume.
More information
To learn more about the Topology/Block Plugin for Slurm, including configuration options and advanced usage, see the Slurm documentation. For help configuring the plugin on your NVL72-powered instances, contact CoreWeave Support.