Skip to main content

Block Scheduling in Slurm

CoreWeave's GB200 NVL72-powered instances harness the groundbreaking architecture of NVIDIA's GB200 Grace Blackwell Superchip and NVLink Switch System in a single rack. The NVLink fabric provides direct memory access between GPUs, enabling ultra-fast communication and data sharing.

To fully leverage the NVL72 architecture, CoreWeave recommends using the Topology/Block Plugin with Slurm. This plugin optimizes job scheduling to maximize NVLink fabric performance by attempting to place Nodes of the same job within the same rack whenever possible. Improving job placement within the NVLink domain enhances resource efficiency and GPU communication, which is critical for large-scale distributed computing tasks where reducing processing time is essential.

Concepts

The Topology/Block Plugin for Slurm introduces new concepts and command-line options to manage the placement of jobs within the NVLink domain. These concepts include Blocks, Segments, and the --exclusive=topo option for job submission.

Blocks

A Block is a group of Nodes within the same Slurm cluster, defined in Slurm's topology.conf file. This file is automatically generated by SUNK, so customers do not need to create or update it. In addition to specifying which Nodes belong to each Block, the file also defines the Block Size—the minimum number of Nodes required in a Block. While a Block must contain at least the Block Size, it can include more Nodes if needed.

Blocks cannot overlap with each other, and each Block is assigned a unique identifier. CoreWeave periodically updates the topology configuration as required to match the physically available Nodes, such as when a Node fails or is drained. CoreWeave's default implementation is one Block per NVL72 system, which can contain between 16 and 18 Nodes, depending on whether some Nodes are unavailable in that system.

The Topology/Block Plugin can also define higher-level blocks for job placement if needed. The high-level block size must be a power of two of the base block size. For example, if the base block size is 18, then a higher-level block size of 36 can also be defined, which consists of two base blocks.

Segments

A Segment is a number of Nodes within a Block that should be placed together. A Segment must be less than or equal to the Block size. A Segment cannot be split into multiple Blocks. Segments can be placed together within the same Block if size permits, or may be placed in different Blocks.

Exclusive option

The --exclusive=topo option for job submission ensures that only the job being submitted is placed in a Block. The job being submitted would not be placed on a Block with an existing job, and once the job is placed, other jobs would not be allowed to be placed there. Unused Nodes are left idle. This option is useful for benchmarking or to avoid competition for resources by other jobs.

Example scenarios

The following examples illustrate different srun job creation scenarios with the Topology/Block Plugin to show how Blocks, Segments, and the --exclusive=topo option can be used together. Nodes are color-coded to represent their status:

  • Green for Nodes that are running our requested jobs
  • Gray for Nodes that are drained, down, running other jobs, or otherwise unavailable
  • White for idle Nodes that are not running any jobs

The scenarios that follow use diagrams to represent NVL72 systems with 18 Nodes each.

A job is running on ten Nodes in Block 1. Nine Nodes are unavailable in Block 2.

In the example above, Block 1 has a job that occupies ten Nodes (green), Block 2 has nine unavailable Nodes (gray), and the remaining Nodes are available for use (white). Use this color guide in the scenarios that follow.

Job larger than Block size

In this scenario, a job requests twenty Nodes, which is larger than the Block size of eighteen Nodes.

Example
$
srun --nodes=20 ...

The Topology/Block Plugin uses fourteen Nodes in Block 1, skips the unavailable Nodes, and partially fills Block 2 with six Nodes. The remaining Nodes are left idle.

Job requests 20 Nodes

Two jobs that fit within a single Block

In this scenario, two jobs are submitted. The first job requests seven Nodes. The second job requests four Nodes. All Nodes are available for use.

Example
$
srun --nodes=7 ... && srun --nodes=4 ...

The Topology/Block Plugin can fit both jobs into Block 1. The first job is shown in green and the second job in lighter green.

Two jobs fit within a single Block

Two jobs that span Blocks

In this scenario, two jobs are submitted, each requesting nineteen Nodes, which is larger than the Block size. Block 2 has two unavailable Nodes.

Example
$
srun --nodes=19 ... && srun --nodes=19 ...

The Topology/Block Plugin fills Block 1 and a single Node in Block 2 to fulfill the request. The second job (lighter green) uses the available fifteen Nodes in Block 2 and four Nodes in Block 3. The remaining Nodes are left idle.

Two jobs that span Blocks

A job with two Segments

In this scenario, a job requests twenty total Nodes, with a Segment length of ten Nodes.

Example
$
srun --nodes=20 --segment=10 ...

The job has two Segments of ten Nodes each. The Topology/Block Plugin cannot fit both ten-Node Segments into a single Block, and Block 2 has too many unavailable Nodes. It runs the first segment in Block 1, and the second segment (lighter green) in Block 3.

A job with two Segments

Exclusive jobs

The previous scenarios showed how the Topology/Block Plugin can place jobs within Blocks that run alongside other jobs. The --exclusive=topo option ensures that no other jobs may be placed in the same Block, and unused Nodes are left idle.

In this scenario, a job requests ten Nodes with the --exclusive=topo option. A competing job (gray) is currently running in Block 1.

Example
$
srun --nodes=10 --exclusive=topo ...

Because the --exclusive=topo option is set, this job cannot run in Block 1 when a different job already occupies Node 1 in Block 1. Therefore, this job runs in Block 2 and no other job is allowed in Block 2 while this job runs.

Exclusive jobs

Combining Segment and the Exclusive option

In this scenario, a job requests ten Nodes with a Segment length of five and the --exclusive=topo option.

Example
$
srun --nodes=10 --segment=5 --exclusive=topo ...

The job requests Segments of five Nodes each. The Topology/Block Plugin cannot exclusively fit a five-Node Segment into Block 1 because another job is already scheduled on it. With --exclusive=topo, Segments from the same job may still be placed on the same Block.

As a result, the Topology/Block Plugin runs both Segments (green, and lighter green) in Block 2. No other jobs will be scheduled on that Block while this job runs.

Combining Segments and Exclusive jobs

More information

For more information on the Topology/Block Plugin for Slurm, including configuration options and advanced usage, see the Slurm documentation. For assistance with configuring the Topology/Block Plugin for your NVL72-powered instances, contact CoreWeave Support.