Resource sharing rules
Before choosing a configuration, understand what can and cannot be shared:- Slurm manages GPUs exclusively. The SUNK Pod Scheduler converts Pod
nvidia.com/gpurequests into Slurm GRES allocations (for example,gres:gpu:h100:4). Slurm assigns specific GPU indices to each job, so no two jobs receive the same physical GPU. Two SUNK-scheduled Pods from the same Slurm user can share a Node and use different GPUs withexclusive: "user". - CPU and memory can be shared between Kubernetes Pods and Slurm jobs, as long as the
slurmdcontainer resource requests are lowered. See Lower slurmd resource requests. - Pods must not use Guaranteed QoS with CPU resources. When CPU requests equal limits, this triggers static CPU allocation, which pins CPU cores for Kubernetes and causes resource contention in Slurm. This causes Slurm nodes to drain.
- A Node can host both GPU and CPU-only workloads. For example, a Slurm job using all GPUs can coexist with a SUNK-scheduled CPU-only Pod, as long as CPU and memory are available.
Common scenarios
Choose the scenario that matches your workload. Each includes the annotations and resource configuration you need.The specific CPU, memory, and GPU values in these examples depend on your Node type and
slurmd configuration. To find the right values for your cluster, see Check available resources.GPU Node scenarios
The following scenarios cover Pods that run on Nodes with GPUs.Full-Node GPU Pod
Use when: A single Pod uses all GPUs on the Node, or you want full Node isolation. Setexclusive: "none" to prevent any other Slurm job from sharing the Node:
slurmd container’s requests within the Node’s allocatable resources:
Multiple partial-GPU Pods sharing a Node
Use when: You want to run multiple Pods that each use a subset of a Node’s GPUs (for example, two Pods each using 4 of 8 GPUs). Create a dedicated Slurm user for Kubernetes Pods and setexclusive: "user". This lets SUNK-scheduled Pods share Nodes with each other while keeping other Slurm users off those Nodes.
slurmd resource requests and enable memory tracking to prevent scheduling failures.
CPU-only Pod alongside GPU workloads
Use when: You need to run a CPU-only workload (monitoring agent, data preprocessing, sandbox environment) on a Node that also runs GPU workloads. Setexclusive: "ok" and do not request GPUs:
slurmd resource requests to make CPU and memory available for the Pod.
CPU Node scenarios
The sameexclusive annotation controls sharing on CPU-only Nodes. The difference is that GPU GRES allocation isn’t a factor, so the main concern is CPU and memory sharing.
Full-Node CPU Pod
Use when: A Pod needs the entire CPU Node with no other workloads.slurmd container requests.
Share a CPU Node between Pods and Slurm jobs
Use when: You want to run SUNK-scheduled Pods and Slurm jobs on the same CPU Node, sharing the CPU and memory pool. Useexclusive: "user" with a dedicated Slurm user so that SUNK Pods and Slurm jobs from the same user can coexist:
exclusive: "ok" to allow sharing with any Slurm user:
slurmd resource requests and enable memory tracking to prevent oversubscription on shared CPU Nodes.
Configure shared resources
The preceding scenarios reference two settings that you must adjust before SUNK-scheduled Pods can share a Node with Slurm jobs. Adjusting these settings ensures the kubelet has enough allocatable resources for the additional Pods. The following sections describe each setting in detail.Check available resources
The resource values you use for Pod requests andslurmd configuration depend on your Node type. Different GPU and CPU Nodes have different amounts of allocatable CPU and memory. Before configuring resource sharing, check your Node’s capacity:
slurmd currently requests:
slurmd and other containers (such as sssd, munged, and user-lookup) determines how much room is available for SUNK-scheduled Pods.
Lower slurmd resource requests
The default SUNK NodeSet configuration requests most of the Node’s CPU and memory for theslurmd container. This leaves little room for other Pods and causes OutOfcpu or OutOfMemory kubelet rejections.
Lower the slurmd container’s requests while keeping limits high so Slurm jobs can still use the full Node. The specific values depend on your Node type, but a common starting point is to set requests to a small fraction of the Node’s total resources:
10Gi of memory for slurmd, freeing the rest for SUNK-scheduled Pods. The high memory limit ensures Slurm jobs can still use the full Node memory. Adjust the limits.memory value to match your Node type’s total memory.
For an example manifest that changes slurmd resources, see Configure compute nodes.
Enable memory tracking in Slurm
By default, Slurm’sSelectTypeParameters is set to CR_Core, which does not track memory as a consumable resource. This lets multiple jobs oversubscribe memory, leading to out-of-memory (OOM) errors.
When sharing Nodes, change SelectTypeParameters to a memory-aware value:
Exclusive annotation values
Thesunk.coreweave.com/exclusive annotation (SUNK v5.7.0 and later) maps directly to Slurm’s --exclusive option. It accepts the following string values:
| Value | Slurm behavior | When to use |
|---|---|---|
"none" | The Node is allocated exclusively to this job. No other jobs can share the Node. | Full-Node GPU Pods, or when you want complete isolation. |
"ok" | The job can share the Node with any other job, regardless of user or account. | CPU-only workloads sharing Nodes with GPU workloads. |
"user" | The job can share the Node only with jobs from the same Slurm user. | Multiple SUNK Pods each using a subset of a Node’s GPUs. This is the recommended approach for partial-GPU Pods. |
"mcs" | The job can share the Node only with jobs that have the same MCS (Multi-Category Security) label. | When using Slurm MCS labels to group workloads by tenant or project. |
"topo" | Reserved for topology-based scheduling. | Consult CoreWeave support before using this value. |
The
"none" value name can be misread. In Slurm’s --exclusive option, none means “exclusive mode is on, and the sharing override is none,” meaning Slurm allows no sharing. It does not mean “no exclusivity.”GPU allocation
Slurm manages GPU allocation exclusively through GRES. The SUNK Pod Scheduler converts Pod GPU requests into Slurm GRES allocations, and Slurm assigns specific GPU indices to each job. Don’t schedule GPU Pods through the standard Kubernetes scheduler on Slurm Nodes, because this bypasses Slurm’s GRES tracking and causes GPU conflicts. When scheduling GPU Pods, choose one of these approaches:- Full-Node exclusive with
exclusive: "none"when the Pod uses all GPUs. - Per-user exclusive with
exclusive: "user"and a dedicateduser-idwhen multiple Pods each use a subset of GPUs. This is the recommended approach for partial-GPU workloads. - Slurm reservation with the
sunk.coreweave.com/reservationannotation to dedicate specific Nodes for Kubernetes Pods.
Reserve resources for system Pods
Slurm doesn’t account for resources consumed by DaemonSets, theslurmd container itself, or other system Pods. As a result, Slurm may allocate resources that appear available from Slurm’s perspective but are already consumed from Kubernetes’ perspective, which causes kubelet rejections.
To account for this overhead:
- Ensure
slurmdresource requests are set low enough to leave room for SUNK-scheduled Pods. - Be conservative with resource requests when packing multiple workloads onto a single Node.
- Set
SelectTypeParametersto a memory-aware value (such asCR_CPU_Memory) so that Slurm tracks memory as a consumable resource.
Scale-up and scale-down behavior
If your cluster autoscales, the timing and Pod-placement behaviors described in the following sections affect how shared Nodes fill and drain. When you use the SUNK Pod Scheduler with autoscaling, be aware of the following behaviors:Scale-up
When new Nodes join the cluster, Slurm’s configuration takes about a minute to include them. During this window, the new Nodes aren’t available for scheduling. Slurm doesn’t bin-pack workloads. When Slurm schedules new Pods, it selects Nodes based on its internal bitmap ordering, which may spread Pods across multiple partially-used Nodes instead of filling one Node before moving to the next. This can lead to GPU fragmentation.Scale-down
When Nodes are removed, Kubernetes selects which Pods to terminate. Kubernetes doesn’t coordinate this selection with Slurm, so Kubernetes may terminate Pods across multiple Nodes rather than fully drain a single Node. Combined with the lack of bin-packing, this can leave Nodes with partially-used GPU allocations that can’t be reclaimed.Impact on exclusive mode
Fragmentation has a larger impact withexclusive: "user" or exclusive: "none". In per-user exclusive mode, a Node with even one remaining Pod can’t accept Slurm jobs from other users, which makes unused GPUs on that Node inaccessible. Plan your scaling strategy accordingly.