Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

When Kubernetes Pods and Slurm jobs run on the same Kubernetes Nodes, you need to configure how they share GPU, CPU, and memory resources. This page walks through the common scenarios and the configuration each one requires.

Resource sharing rules

Before choosing a configuration, understand what can and cannot be shared:
  • GPUs are managed exclusively by Slurm. The SUNK Pod Scheduler converts Pod nvidia.com/gpu requests into Slurm GRES allocations (for example, gres:gpu:h100:4). Slurm assigns specific GPU indices to each job, so no two jobs receive the same physical GPU. Two SUNK-scheduled Pods from the same Slurm user can share a Node and use different GPUs with exclusive: "user".
  • CPU and memory can be shared between Kubernetes Pods and Slurm jobs, as long as the slurmd container resource requests are lowered. See Lower slurmd resource requests.
  • Pods must not use Guaranteed QoS with CPU resources. When CPU requests equals limits, this triggers static CPU allocation, which pins CPU cores for Kubernetes and causes resource contention in Slurm. This ultimately causes Slurm nodes to drain.
  • A Node can host both GPU and CPU-only workloads. For example, a Slurm job using all GPUs can coexist with a SUNK-scheduled CPU-only Pod, as long as CPU and memory are available.
Slurm has no visibility into Pods scheduled through the standard Kubernetes scheduler. Running non-DaemonSet Pods on Slurm Nodes without the SUNK Pod Scheduler causes GPU allocation conflicts and resource contention.

Common scenarios

Choose the scenario that matches your workload. Each includes the annotations and resource configuration you need.
The specific CPU, memory, and GPU values in these examples depend on your Node type and slurmd configuration. To find the right values for your cluster, see Check available resources.

GPU Node scenarios

Full-Node GPU Pod

Use when: A single Pod uses all GPUs on the Node, or you want full Node isolation. Set exclusive: "none" to prevent any other Slurm job from sharing the Node:
annotations:
  sunk.coreweave.com/exclusive: "none"
Set your Pod’s resource requests low and use high limits, because the kubelet must fit both the Pod’s requests and the slurmd container’s requests within the Node’s allocatable resources:
resources:
  limits:
    memory: 960Gi       # High limit so the container can use full Node memory
    nvidia.com/gpu: "8"
  requests:
    cpu: "10"           # Low request to fit alongside slurmd
    memory: 10Gi        # Low request to fit alongside slurmd
    nvidia.com/gpu: "8"
Even with full-Node exclusive mode, the kubelet enforces Kubernetes resource accounting. If the slurmd container requests most of the Node’s memory (the default), a Pod with high memory requests will be rejected with OutOfmemory. Either keep Pod requests low (as shown above) or lower slurmd resource requests.

Multiple partial-GPU Pods sharing a Node

Use when: You want to run multiple Pods that each use a subset of a Node’s GPUs (for example, two Pods each using 4 of 8 GPUs). Create a dedicated Slurm user for Kubernetes Pods and set exclusive: "user". This allows SUNK-scheduled Pods to share Nodes with each other while keeping other Slurm users off those Nodes.
annotations:
  sunk.coreweave.com/exclusive: "user"
  sunk.coreweave.com/user-id: "1001"  # Dedicated Slurm user for K8s Pods
resources:
  limits:
    memory: 480Gi
    nvidia.com/gpu: "4"
  requests:
    cpu: "10"           # Low request to fit alongside slurmd
    memory: 10Gi        # Low request to fit alongside slurmd
    nvidia.com/gpu: "4"
Resource configuration required: You must lower slurmd resource requests and enable memory tracking to prevent scheduling failures.

CPU-only Pod alongside GPU workloads

Use when: You need to run a CPU-only workload (monitoring agent, data preprocessing, sandbox environment) on a Node that also runs GPU workloads. Set exclusive: "ok" and do not request GPUs:
annotations:
  sunk.coreweave.com/exclusive: "ok"
resources:
  requests:
    cpu: "10"
    memory: 10Gi
  # No nvidia.com/gpu request
Because the Pod does not request GPUs, there is no conflict with Slurm’s GPU allocator. Resource configuration required: You must lower slurmd resource requests to make CPU and memory available for the Pod.
A CPU-only Pod with exclusive: "ok" cannot land on a Node where a Slurm job is running with full exclusive mode (--exclusive). If you need CPU-only Pods to coexist with GPU jobs, use exclusive: "user" on the GPU workloads instead.

CPU Node scenarios

The same exclusive annotation controls sharing on CPU-only Nodes. The difference is that GPU GRES allocation is not a factor, so the main concern is CPU and memory sharing.

Full-Node CPU Pod

Use when: A Pod needs the entire CPU Node with no other workloads.
annotations:
  sunk.coreweave.com/exclusive: "none"
resources:
  limits:
    memory: 256Gi       # Match your CPU Node's total memory
  requests:
    cpu: "10"
    memory: 10Gi
As with GPU Nodes, keep Pod requests low to fit alongside slurmd container requests.

Sharing a CPU Node between Pods and Slurm jobs

Use when: You want to run SUNK-scheduled Pods and Slurm jobs on the same CPU Node, sharing the CPU and memory pool. Use exclusive: "user" with a dedicated Slurm user so that SUNK Pods and Slurm jobs from the same user can coexist:
annotations:
  sunk.coreweave.com/exclusive: "user"
  sunk.coreweave.com/user-id: "1001"  # Dedicated Slurm user for K8s Pods
resources:
  requests:
    cpu: "10"
    memory: 10Gi
Alternatively, use exclusive: "ok" to allow sharing with any Slurm user:
annotations:
  sunk.coreweave.com/exclusive: "ok"
Resource configuration required: You must lower slurmd resource requests and enable memory tracking to prevent oversubscription on shared CPU Nodes.

Configure shared resources

When Pods share a Node with Slurm jobs (the partial-GPU and CPU-only scenarios above), you need to adjust two settings so that the kubelet has enough allocatable resources for the additional Pods.

Check available resources

The resource values you use for Pod requests and slurmd configuration depend on your Node type. Different GPU and CPU Nodes have different amounts of allocatable CPU and memory. Before configuring resource sharing, check your Node’s capacity:
kubectl get node [NODE-NAME] -o jsonpath='cpu: {.status.allocatable.cpu}, memory: {.status.allocatable.memory}'
Then check what slurmd currently requests:
kubectl get pod [SLURMD-POD-NAME] -n [NAMESPACE] -o jsonpath='{.spec.containers[?(@.name=="slurmd")].resources.requests}'
The difference between the Node’s allocatable resources and the total requests from slurmd and other containers (such as sssd, munged, and user-lookup) determines how much room is available for SUNK-scheduled Pods.

Lower slurmd resource requests

The default SUNK NodeSet configuration requests most of the Node’s CPU and memory for the slurmd container. This leaves little room for other Pods and causes OutOfcpu or OutOfMemory kubelet rejections. Lower the slurmd container’s requests while keeping limits high so Slurm jobs can still use the full Node. The specific values depend on your Node type, but a common starting point is to set requests to a small fraction of the Node’s total resources:
nodes:
  my-gpu-nodes:
    enabled: true
    definitions:
      - h100
    replicas: 4
    resources:
      requests:
        cpu: 10
        memory: 10Gi
      limits:
        memory: 1920Gi  # Set to your Node's total memory
This tells Kubernetes that slurmd needs only 10 CPUs and 10Gi of memory reserved, freeing the rest for SUNK-scheduled Pods. The high memory limit ensures Slurm jobs can still use the full Node memory. Adjust the limits.memory value to match your Node type’s total memory.
Skipping this step is the most common cause of Pod scheduling failures when sharing Nodes. If your Pods are rejected with OutOfMemory or OutOfcpu, check these values first.
To see an example manifest for changing slurmd resources, see Configure Compute nodes.

Enable memory tracking in Slurm

By default, Slurm’s SelectTypeParameters is set to CR_Core, which does not track memory as a consumable resource. This allows multiple jobs to oversubscribe memory, leading to out-of-memory (OOM) errors. When sharing Nodes, change SelectTypeParameters to a memory-aware value:
slurmConfig:
  SelectTypeParameters: CR_CPU_Memory
This tells Slurm to track both CPU and memory when placing jobs, preventing oversubscription.

Exclusive annotation values

The sunk.coreweave.com/exclusive annotation (SUNK v5.7.0 and later) maps directly to Slurm’s --exclusive option. It accepts the following string values:
ValueSlurm behaviorWhen to use
"none"The Node is allocated exclusively to this job. No other jobs can share the Node.Full-Node GPU Pods, or when you want complete isolation.
"ok"The job can share the Node with any other job, regardless of user or account.CPU-only workloads sharing Nodes with GPU workloads.
"user"The job can share the Node only with jobs from the same Slurm user.Multiple SUNK Pods each using a subset of a Node’s GPUs. This is the recommended approach for partial-GPU Pods.
"mcs"The job can share the Node only with jobs that have the same MCS (Multi-Category Security) label.When using Slurm MCS labels to group workloads by tenant or project.
"topo"Reserved for topology-based scheduling.Consult CoreWeave support before using this value.
The "none" value name can be confusing. In Slurm’s --exclusive option, none means “exclusive mode is on, and the sharing override is none,” meaning no sharing is allowed. It does not mean “no exclusivity.”

GPU allocation

GPU allocation is managed exclusively by Slurm through GRES. The SUNK Pod Scheduler converts Pod GPU requests into Slurm GRES allocations, and Slurm assigns specific GPU indices to each job. Do not schedule GPU Pods through the standard Kubernetes scheduler on Slurm Nodes, as this bypasses Slurm’s GRES tracking and causes GPU conflicts. When scheduling GPU Pods, choose one of these approaches:
  • Full-Node exclusive with exclusive: "none" when the Pod uses all GPUs.
  • Per-user exclusive with exclusive: "user" and a dedicated user-id when multiple Pods each use a subset of GPUs. This is the recommended approach for partial-GPU workloads.
  • Slurm reservation with the sunk.coreweave.com/reservation annotation to dedicate specific Nodes for Kubernetes Pods.

Reserve resources for system Pods

Slurm does not see resources consumed by DaemonSets, the slurmd container itself, or other system Pods. This means Slurm may allocate resources that appear available from Slurm’s perspective but are already consumed from Kubernetes’ perspective, causing kubelet rejections. To account for this overhead:
  1. Ensure slurmd resource requests are set low enough to leave room for SUNK-scheduled Pods.
  2. Be conservative with resource requests when packing multiple workloads onto a single Node.
  3. Set SelectTypeParameters to a memory-aware value (such as CR_CPU_Memory) so Slurm tracks memory as a consumable resource.

Scale-up and scale-down behavior

When using the SUNK Pod Scheduler with autoscaling, be aware of the following behaviors:

Scale-up

When new Nodes join the cluster, there is an approximately one-minute delay before Slurm’s configuration includes them. During this window, the new Nodes are not available for scheduling. Slurm does not bin-pack workloads. When scheduling new Pods, Slurm selects Nodes based on its internal bitmap ordering, which may spread Pods across multiple partially-used Nodes instead of filling one Node before moving to the next. This can lead to GPU fragmentation.

Scale-down

When Nodes are removed, Kubernetes selects which Pods to terminate. This selection is not coordinated with Slurm, so Pods may be terminated across multiple Nodes rather than fully draining a single Node. Combined with the lack of bin-packing, this can leave Nodes with partially-used GPU allocations that cannot be reclaimed.

Impact on exclusive mode

Fragmentation is particularly impactful with exclusive: "user" or exclusive: "none". In per-user exclusive mode, a Node with even one remaining Pod cannot accept Slurm jobs from other users, making unused GPUs on that Node inaccessible. Plan your scaling strategy accordingly.
Last modified on April 20, 2026