Skip to main content

Topology/Block Scheduling in Slurm

CoreWeave's GB200 NVL72-powered instances harness the groundbreaking architecture of NVIDIA's GB200 Grace Blackwell Superchip and NVLink Switch System, integrated within a single rack. The NVLink fabric enables direct memory access between GPUs, allowing for ultra-fast communication and efficient data sharing.

To fully leverage the NVL72 architecture, CoreWeave recommends using the Topology/Block Plugin with Slurm. This plugin improves job scheduling by placing Nodes from the same job within the same rack whenever possible, maximizing NVLink fabric performance. Optimizing job placement within the NVLink domain increases resource efficiency and enhances GPU communication—critical for large-scale distributed workloads where minimizing processing time is essential.

Concepts

The Slurm Topology/Block Plugin adds new features for managing job placement within NVLink domains. Key concepts include Blocks, Segments, and the --exclusive=topo option.

When SUNK auto-generates the topology file, it defines Blocks based on the ds.coreweave.com/nvlink.domain label on CKS Nodes. For GB200 Nodes, this label is a globally unique rack identifier—for example, DH4-016-US-EAST-02A.

In clusters with both NVLink-enabled multi-Node systems and standard Nodes, SUNK creates a topology.conf file that includes Block and switch definitions. This configuration does not affect behavior unless the Topology/Block Plugin is explicitly enabled, ensuring safe integration and runtime flexibility.

Blocks

A Block is a group of Nodes in the same Slurm cluster, defined in the topology.conf file. This file is automatically generated by SUNK, so you don't need to create or manage it manually. It specifies which Nodes belong to each Block and defines the Block size—the minimum number of Nodes in a Block.

Each Block has a unique ID and cannot overlap with others. CoreWeave dynamically updates the topology configuration to reflect changes in Node availability. By default, each NVL72 system corresponds to a Block containing 18 Nodes in the same NVLink domain. If any Nodes become unhealthy, the CoreWeave platform removes them from the SUNK cluster. However, the block size stays fixed at 18. As a result, some Blocks may temporarily have fewer than 18 available Nodes.

Segments

A Segment is a set of Nodes within a Block that are allocated together. It must be contained within a single Block and cannot cross Block boundaries. Segments reduce resource fragmentation by keeping allocations intact. If a job spans multiple Blocks, Slurm schedules equal-sized Segments in each Block. A Block can hold multiple Segments if space allows.

By default, if no segment size is provided, Slurm uses the full Block size.

Important

Best practice: Always use --segment=1 unless your job requires a larger segment. This makes your job more flexible by allowing it to run on any available Nodes across all Blocks, improving scheduling efficiency.

If you omit the segment size and your job needs more Nodes than are free in a single Block, Slurm may fail to schedule it. To prevent this, always set the segment size explicitly.

All examples in this guide follow this best practice by specifying the segment size to ensure reliable scheduling.

Exclusive option

The --exclusive=topo option ensures that only the job being submitted runs in a Block. The job won't share a Block with any other job, and no new jobs can be placed there until the original job completes. Unused Nodes in the Block remain idle. Use this option for benchmarking or to avoid resource competition.

Important

Using --exclusive=topo can result in idle Nodes if your job doesn't fill the entire Block. To avoid wasting resources, apply this option only when needed—such as for benchmarking or when your workload requires exclusive access to a Block without sharing with other jobs.

Example Scenarios

The following scenarios show how to use the Topology/Block Plugin to optimize job placement within NVL72 systems. Each diagram shows multiple 18-Node NVL72 systems (Blocks). Node colors represent status:

  • Green: Nodes running the requested job (light green indicates a second job or Segment)
  • Gray: Unavailable—drained, down, or running unrelated jobs
  • White: Idle

A job that fits in a Block

A job requests 10 Nodes, which is less than the Block size. The segment size is set to 1, following best practice.

srun --nodes=10 --segment=1 ...

The plugin allocates 10 Nodes from Block 1. The rest of Block 1 and Block 2 remain idle.

Job requests 10 Nodes.

Jobs larger than the Block size

A job requests 20 Nodes, which exceeds the Block size of 18. The segment size is set to 1, following best practice.

srun --nodes=20 --segment=1 ...

Two Nodes are unavailable in Block 1, so the plugin allocates 16 Nodes from Block 1 and 4 Nodes from Block 2. The rest of Block 2 remains idle.

Job requests 20 Nodes

Two jobs in a single Block

Two jobs are submitted. The first requests 8 Nodes; the second requests 4. The segment size is set to 1 for both jobs, following best practice.

srun --nodes=8 --segment=1 ... && srun --nodes=4 --segment=1 ...

The plugin places both jobs in Block 1. The rest of Block 1 and Block 2 remain idle.

Two jobs can fit in a single Block

Realistic training example: A job with two Segments

A job requests 32 Nodes, divided into two Segments of 16 Nodes each.

srun --nodes=32 --segment=16 ...

The plugin places the first Segment in Block 1 and the second Segment in Block 2.

Two Nodes are left idle in Block 1 bBecause Segments cannot span Blocks.

A job with two Segments

Exclusive jobs

A job requests 16 Nodes with the --exclusive=topo option. The segment size is set to 1, following best practice.

srun --nodes=16 --segment=1 --exclusive=topo ...

The plugin places the job in Block 2 because --exclusive=topo prevents it from running in Block 1 while a competing job (light green) is active. The rest of Block 2 remains idle, and no other jobs are allowed there until this job completes.

Exclusive jobs

Combining Segments and the exclusive option

A job requests 16 Nodes, divided into two 8-Node Segments, using --exclusive=topo.

srun --nodes=16 --segment=8 --exclusive=topo ...

The plugin places both Segments (green and light green) in Block 2 because Block 1 is unavailable due to an existing job (dark green). The rest of Block 2 remains idle, and no other jobs are allowed there until both job Segments complete.

Combining Segments and Exclusive jobs

Identify idle Nodes in a block

You can use the script at /usr/share/sunk/bin/segment-calc.sh on the Slurm login node to check for idle Nodes within a Block. By default, it reports idle Nodes per topology Block. You can optionally provide the --segment N argument to report only those Blocks with at least N idle Nodes available.

Example usage

Example
$ ./segment-calc.sh --segment 6
Block IDLE / SIZE
----------------------------------- ------ -----
block_DH4-004-US-EAST-02A 8 / 18
block_DH4-012-US-EAST-02A 8 / 18
block_DH4-013-US-EAST-02A 8 / 18
block_DH4-017-US-EAST-02A 8 / 18
block_DH4-043-US-EAST-02A 18 / 18
Blocks with ≥ 6 idle nodes : 5
Whole segments schedulable : 7
Total nodes those segments consume : 42

This output shows the Blocks with at least 6 idle Nodes, along with the total number of whole segments that can be scheduled and the total Nodes those segments would consume.

More information

To learn more about the Topology/Block Plugin for Slurm, including configuration options and advanced usage, see the Slurm documentation. For help configuring the plugin on your NVL72-powered instances, contact CoreWeave Support.