Skip to main content
When Kubernetes Pods and Slurm jobs run on the same Kubernetes Nodes, you need to configure how they share GPU, CPU, and memory resources. This page is for cluster administrators and workload operators who use the SUNK Pod Scheduler. It walks through the common scenarios and the configuration each one requires so that Pods and Slurm jobs can coexist without resource contention or scheduling failures.

Resource sharing rules

Before choosing a configuration, understand what can and cannot be shared:
  • Slurm manages GPUs exclusively. The SUNK Pod Scheduler converts Pod nvidia.com/gpu requests into Slurm GRES allocations (for example, gres:gpu:h100:4). Slurm assigns specific GPU indices to each job, so no two jobs receive the same physical GPU. Two SUNK-scheduled Pods from the same Slurm user can share a Node and use different GPUs with exclusive: "user".
  • CPU and memory can be shared between Kubernetes Pods and Slurm jobs, as long as the slurmd container resource requests are lowered. See Lower slurmd resource requests.
  • Pods must not use Guaranteed QoS with CPU resources. When CPU requests equal limits, this triggers static CPU allocation, which pins CPU cores for Kubernetes and causes resource contention in Slurm. This causes Slurm nodes to drain.
  • A Node can host both GPU and CPU-only workloads. For example, a Slurm job using all GPUs can coexist with a SUNK-scheduled CPU-only Pod, as long as CPU and memory are available.
Slurm has no visibility into Pods scheduled through the standard Kubernetes scheduler. If you run non-DaemonSet Pods on Slurm Nodes without the SUNK Pod Scheduler, you cause GPU allocation conflicts and resource contention.

Common scenarios

Choose the scenario that matches your workload. Each includes the annotations and resource configuration you need.
The specific CPU, memory, and GPU values in these examples depend on your Node type and slurmd configuration. To find the right values for your cluster, see Check available resources.

GPU Node scenarios

The following scenarios cover Pods that run on Nodes with GPUs.

Full-Node GPU Pod

Use when: A single Pod uses all GPUs on the Node, or you want full Node isolation. Set exclusive: "none" to prevent any other Slurm job from sharing the Node:
annotations:
  sunk.coreweave.com/exclusive: "none"
Set your Pod’s resource requests low and use high limits, because the kubelet must fit both the Pod’s requests and the slurmd container’s requests within the Node’s allocatable resources:
resources:
  limits:
    memory: 960Gi       # High limit so the container can use full Node memory
    nvidia.com/gpu: "8"
  requests:
    cpu: "10"           # Low request to fit alongside slurmd
    memory: 10Gi        # Low request to fit alongside slurmd
    nvidia.com/gpu: "8"
Even with full-Node exclusive mode, the kubelet enforces Kubernetes resource accounting. If the slurmd container requests most of the Node’s memory (the default), the kubelet rejects a Pod with high memory requests and reports OutOfmemory. Either keep Pod requests low (as shown in the preceding example) or lower slurmd resource requests.

Multiple partial-GPU Pods sharing a Node

Use when: You want to run multiple Pods that each use a subset of a Node’s GPUs (for example, two Pods each using 4 of 8 GPUs). Create a dedicated Slurm user for Kubernetes Pods and set exclusive: "user". This lets SUNK-scheduled Pods share Nodes with each other while keeping other Slurm users off those Nodes.
annotations:
  sunk.coreweave.com/exclusive: "user"
  sunk.coreweave.com/user-id: "1001"  # Dedicated Slurm user for K8s Pods
resources:
  limits:
    memory: 480Gi
    nvidia.com/gpu: "4"
  requests:
    cpu: "10"           # Low request to fit alongside slurmd
    memory: 10Gi        # Low request to fit alongside slurmd
    nvidia.com/gpu: "4"
Resource configuration required: You must lower slurmd resource requests and enable memory tracking to prevent scheduling failures.

CPU-only Pod alongside GPU workloads

Use when: You need to run a CPU-only workload (monitoring agent, data preprocessing, sandbox environment) on a Node that also runs GPU workloads. Set exclusive: "ok" and do not request GPUs:
annotations:
  sunk.coreweave.com/exclusive: "ok"
resources:
  requests:
    cpu: "10"
    memory: 10Gi
  # No nvidia.com/gpu request
Because the Pod doesn’t request GPUs, no conflict occurs with Slurm’s GPU allocator. Resource configuration required: You must lower slurmd resource requests to make CPU and memory available for the Pod.
A CPU-only Pod with exclusive: "ok" can’t land on a Node where a Slurm job runs with full exclusive mode (--exclusive). If you need CPU-only Pods to coexist with GPU jobs, use exclusive: "user" on the GPU workloads instead.

CPU Node scenarios

The same exclusive annotation controls sharing on CPU-only Nodes. The difference is that GPU GRES allocation isn’t a factor, so the main concern is CPU and memory sharing.

Full-Node CPU Pod

Use when: A Pod needs the entire CPU Node with no other workloads.
annotations:
  sunk.coreweave.com/exclusive: "none"
resources:
  limits:
    memory: 256Gi       # Match your CPU Node's total memory
  requests:
    cpu: "10"
    memory: 10Gi
As with GPU Nodes, keep Pod requests low to fit alongside slurmd container requests.

Share a CPU Node between Pods and Slurm jobs

Use when: You want to run SUNK-scheduled Pods and Slurm jobs on the same CPU Node, sharing the CPU and memory pool. Use exclusive: "user" with a dedicated Slurm user so that SUNK Pods and Slurm jobs from the same user can coexist:
annotations:
  sunk.coreweave.com/exclusive: "user"
  sunk.coreweave.com/user-id: "1001"  # Dedicated Slurm user for K8s Pods
resources:
  requests:
    cpu: "10"
    memory: 10Gi
Alternatively, use exclusive: "ok" to allow sharing with any Slurm user:
annotations:
  sunk.coreweave.com/exclusive: "ok"
Resource configuration required: You must lower slurmd resource requests and enable memory tracking to prevent oversubscription on shared CPU Nodes.

Configure shared resources

The preceding scenarios reference two settings that you must adjust before SUNK-scheduled Pods can share a Node with Slurm jobs. Adjusting these settings ensures the kubelet has enough allocatable resources for the additional Pods. The following sections describe each setting in detail.

Check available resources

The resource values you use for Pod requests and slurmd configuration depend on your Node type. Different GPU and CPU Nodes have different amounts of allocatable CPU and memory. Before configuring resource sharing, check your Node’s capacity:
kubectl get node [NODE-NAME] -o jsonpath='cpu: {.status.allocatable.cpu}, memory: {.status.allocatable.memory}'
Then check what slurmd currently requests:
kubectl get pod [SLURMD-POD-NAME] -n [NAMESPACE] -o jsonpath='{.spec.containers[?(@.name=="slurmd")].resources.requests}'
The difference between the Node’s allocatable resources and the total requests from slurmd and other containers (such as sssd, munged, and user-lookup) determines how much room is available for SUNK-scheduled Pods.

Lower slurmd resource requests

The default SUNK NodeSet configuration requests most of the Node’s CPU and memory for the slurmd container. This leaves little room for other Pods and causes OutOfcpu or OutOfMemory kubelet rejections. Lower the slurmd container’s requests while keeping limits high so Slurm jobs can still use the full Node. The specific values depend on your Node type, but a common starting point is to set requests to a small fraction of the Node’s total resources:
nodes:
  my-gpu-nodes:
    enabled: true
    definitions:
      - h100
    replicas: 4
    resources:
      requests:
        cpu: 10
        memory: 10Gi
      limits:
        memory: 1920Gi  # Set to your Node's total memory
This configuration reserves only 10 CPUs and 10Gi of memory for slurmd, freeing the rest for SUNK-scheduled Pods. The high memory limit ensures Slurm jobs can still use the full Node memory. Adjust the limits.memory value to match your Node type’s total memory.
If you skip this step, Pod scheduling failures often follow when sharing Nodes. If the kubelet rejects your Pods with OutOfMemory or OutOfcpu, check these values first.
For an example manifest that changes slurmd resources, see Configure compute nodes.

Enable memory tracking in Slurm

By default, Slurm’s SelectTypeParameters is set to CR_Core, which does not track memory as a consumable resource. This lets multiple jobs oversubscribe memory, leading to out-of-memory (OOM) errors. When sharing Nodes, change SelectTypeParameters to a memory-aware value:
slurmConfig:
  SelectTypeParameters: CR_CPU_Memory
With this setting, Slurm tracks both CPU and memory when it places jobs, which prevents oversubscription. With these two settings in place, the kubelet and Slurm have a consistent view of available CPU and memory on each shared Node, and the kubelet can admit SUNK-scheduled Pods alongside Slurm jobs.

Exclusive annotation values

The sunk.coreweave.com/exclusive annotation (SUNK v5.7.0 and later) maps directly to Slurm’s --exclusive option. It accepts the following string values:
ValueSlurm behaviorWhen to use
"none"The Node is allocated exclusively to this job. No other jobs can share the Node.Full-Node GPU Pods, or when you want complete isolation.
"ok"The job can share the Node with any other job, regardless of user or account.CPU-only workloads sharing Nodes with GPU workloads.
"user"The job can share the Node only with jobs from the same Slurm user.Multiple SUNK Pods each using a subset of a Node’s GPUs. This is the recommended approach for partial-GPU Pods.
"mcs"The job can share the Node only with jobs that have the same MCS (Multi-Category Security) label.When using Slurm MCS labels to group workloads by tenant or project.
"topo"Reserved for topology-based scheduling.Consult CoreWeave support before using this value.
The "none" value name can be misread. In Slurm’s --exclusive option, none means “exclusive mode is on, and the sharing override is none,” meaning Slurm allows no sharing. It does not mean “no exclusivity.”

GPU allocation

Slurm manages GPU allocation exclusively through GRES. The SUNK Pod Scheduler converts Pod GPU requests into Slurm GRES allocations, and Slurm assigns specific GPU indices to each job. Don’t schedule GPU Pods through the standard Kubernetes scheduler on Slurm Nodes, because this bypasses Slurm’s GRES tracking and causes GPU conflicts. When scheduling GPU Pods, choose one of these approaches:
  • Full-Node exclusive with exclusive: "none" when the Pod uses all GPUs.
  • Per-user exclusive with exclusive: "user" and a dedicated user-id when multiple Pods each use a subset of GPUs. This is the recommended approach for partial-GPU workloads.
  • Slurm reservation with the sunk.coreweave.com/reservation annotation to dedicate specific Nodes for Kubernetes Pods.

Reserve resources for system Pods

Slurm doesn’t account for resources consumed by DaemonSets, the slurmd container itself, or other system Pods. As a result, Slurm may allocate resources that appear available from Slurm’s perspective but are already consumed from Kubernetes’ perspective, which causes kubelet rejections. To account for this overhead:
  • Ensure slurmd resource requests are set low enough to leave room for SUNK-scheduled Pods.
  • Be conservative with resource requests when packing multiple workloads onto a single Node.
  • Set SelectTypeParameters to a memory-aware value (such as CR_CPU_Memory) so that Slurm tracks memory as a consumable resource.

Scale-up and scale-down behavior

If your cluster autoscales, the timing and Pod-placement behaviors described in the following sections affect how shared Nodes fill and drain. When you use the SUNK Pod Scheduler with autoscaling, be aware of the following behaviors:

Scale-up

When new Nodes join the cluster, Slurm’s configuration takes about a minute to include them. During this window, the new Nodes aren’t available for scheduling. Slurm doesn’t bin-pack workloads. When Slurm schedules new Pods, it selects Nodes based on its internal bitmap ordering, which may spread Pods across multiple partially-used Nodes instead of filling one Node before moving to the next. This can lead to GPU fragmentation.

Scale-down

When Nodes are removed, Kubernetes selects which Pods to terminate. Kubernetes doesn’t coordinate this selection with Slurm, so Kubernetes may terminate Pods across multiple Nodes rather than fully drain a single Node. Combined with the lack of bin-packing, this can leave Nodes with partially-used GPU allocations that can’t be reclaimed.

Impact on exclusive mode

Fragmentation has a larger impact with exclusive: "user" or exclusive: "none". In per-user exclusive mode, a Node with even one remaining Pod can’t accept Slurm jobs from other users, which makes unused GPUs on that Node inaccessible. Plan your scaling strategy accordingly.
Last modified on May 27, 2026