Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

The SUNK Pod Scheduler lets you run Kubernetes Pods alongside Slurm jobs in the same cluster. Instead of managing separate clusters for different workload types, you can run training with Slurm and inference with Kubernetes on shared Nodes. When a Pod is assigned to the SUNK Pod Scheduler, it creates a placeholder Slurm job so that Slurm manages Node placement, GPU allocation, and resource tracking for the Pod. This page walks through how to enable the scheduler, configure a Pod to use it, choose a Node sharing strategy, and the annotations and limitations you should know about. It’s intended for cluster operators and workload owners who want to consolidate Slurm and Kubernetes workloads on shared infrastructure.

Enable the scheduler

Before you can submit Pods to the SUNK Pod Scheduler, you must enable it in the Slurm Helm chart and tell it which namespaces to watch. Set scheduler.enabled to true in your Slurm Helm chart values:
scheduler:
  enabled: true
Configure the scheduler’s scope to watch the namespaces where you create your Pods:
  • To monitor the entire cluster, set scheduler.scope.type to cluster.
  • To monitor specific namespaces, set scheduler.scope.type to namespace and set scheduler.scope.namespaces to a comma-separated list of namespaces. If scheduler.scope.namespaces is blank, the scheduler defaults to the namespace where it’s deployed.

Configure a Pod for scheduling

With the scheduler enabled, each Pod that targets it needs three pieces of configuration: the scheduler name, resource requests, and a valid termination grace period. The following sections describe each requirement.

Set the scheduler name

In your Pod’s spec, set schedulerName to the name of your SUNK Pod Scheduler so Kubernetes routes the Pod to SUNK instead of the default scheduler. The default name is <namespace>-<releaseName>, but you can set it manually with scheduler.name in your Helm values.

Set resource requests

Your Pod must request specific CPU and memory resources so Slurm can account for the Pod against Node capacity. If these resource requests are missing or zero, the Pod fails to schedule.

Set the termination grace period

Set terminationGracePeriodSeconds to a value strictly less than the scheduler’s --slurm-kill-wait minus 5 seconds. This keeps Kubernetes Pod termination aligned with Slurm’s job termination window so the placeholder job and the Pod tear down cleanly. For example, if --slurm-kill-wait=30s (the default), then terminationGracePeriodSeconds must be less than 25.
The Kubernetes default terminationGracePeriodSeconds is 30 seconds, which exceeds the default threshold of 25 seconds. You must explicitly set this value on your Pod spec. A typical value is 5 to 20.

Look up the scheduler configuration

If you don’t know the scheduler name or the current --slurm-kill-wait value, you can read both directly from the running scheduler Pod. Run this command to find both the scheduler name and the kill-wait value:
kubectl get pods -l app.kubernetes.io/name=sunk-scheduler -oyaml | yq '.items[0].spec.containers[] | select(.name == "scheduler").args'
Look for --scheduler-name and --slurm-kill-wait in the output:
- --scheduler-name=tenant-slurm-staging-slurm-scheduler
- --slurm-kill-wait=30s

Example Pod

This example shows a full-node GPU Pod with all three required settings:
apiVersion: v1
kind: Pod
metadata:
  name: slurm-test
  namespace: tenant-slurm-staging
  annotations:
    sunk.coreweave.com/account: root
    sunk.coreweave.com/comment: "A Kubernetes job"
    sunk.coreweave.com/exclusive: "none"
spec:
  schedulerName: tenant-slurm-staging-slurm-scheduler
  containers:
    - name: main
      image: ubuntu
      command:
        - /bin/bash
      args:
        - -c
        - sleep infinity
      resources:
        limits:
          memory: 960Gi
          rdma/ib: "1"
          nvidia.com/gpu: "8"
        requests:
          cpu: "10"
          memory: 10Gi
          nvidia.com/gpu: "8"
  terminationGracePeriodSeconds: 5
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: gpu.nvidia.com/model
                operator: In
                values:
                  - H100_NVLINK_80GB
For GPU Pods, set a Node affinity for gpu.nvidia.com/model to target the GPU type you need. The CPU, memory, and GPU values in this example are illustrative. Adjust them based on your Node type and slurmd configuration. For guidance, see Check available resources.

Choose a Node sharing strategy

After the Pod has the required fields, decide how it shares Nodes with other workloads. The sunk.coreweave.com/exclusive annotation controls this behavior, and the right value depends on your workload:
Workload typeAnnotationEffect
Pod needs the entire Node (all GPUs, or full CPU Node)exclusive: "none"Full Node isolation. No other jobs can share the Node.
Multiple Pods or jobs share a Node (partial GPUs, or CPU pool)exclusive: "user"Jobs from the same Slurm user can share the Node.
Pod can share a Node with any workloadexclusive: "ok"Open sharing with any user or account.
For detailed guidance on each scenario, including resource configuration and examples, see Manage resources with the SUNK Pod Scheduler.

Annotations reference

These annotations configure Slurm job parameters for SUNK-scheduled Pods. All annotations are optional.
AnnotationTypeDescriptionExample
sunk.coreweave.com/accountstringSlurm account for the job. List accounts with sacctmgr list account."root"
sunk.coreweave.com/commentstringDescriptive comment to add to the Slurm job."A Kubernetes job"
sunk.coreweave.com/constraintsstringNode constraints for job placement. See the Slurm sbatch documentation for syntax.
sunk.coreweave.com/exclusive (v5.7.0+)stringControls Node sharing. See Exclusive annotation values."none", "ok", "user"
sunk.coreweave.com/exclusive (before v5.7.0)booleanEnables exclusive mode. See the Slurm exclusive documentation."true"
sunk.coreweave.com/excluded-nodesstringComma-separated Slurm node names to exclude."a100-092-01,a100-092-02"
sunk.coreweave.com/required-nodesstringComma-separated Slurm node names required."a100-092-04,a100-092-06"
sunk.coreweave.com/partitionstringSlurm partition to submit the job to."default"
sunk.coreweave.com/preferstringNode preferences. See Slurm prefer documentation.
sunk.coreweave.com/priorityintegerJob priority value. Must be an integer."100"
sunk.coreweave.com/qosstringQuality of Service level."normal"
sunk.coreweave.com/reservationstringSlurm reservation name."test"
sunk.coreweave.com/user-idintegerUser ID to run the job as."1001"
sunk.coreweave.com/group-idintegerGroup ID to run the job as."1002"
sunk.coreweave.com/current-working-dirstringWorking directory path for the job."/path/to/use"
sunk.coreweave.com/timeoutintegerTime limit in minutes for the Slurm placeholder job."123"
  • Invalid annotation values cause the Pod to fail scheduling.
  • Changes to annotations after a Pod is scheduled don’t affect the running job.
  • The Slurm job defaults to the root user (uid=0, gid=0). If you set user-id without group-id, the group ID is set to the user ID value.

Time limits

The sunk.coreweave.com/timeout annotation sets a time limit (in minutes) on the Slurm placeholder job. The Kubernetes time limit begins when the scheduler places the Pod onto a Node, which includes Pod initialization time before the container starts. This is earlier than Slurm’s normal time limit, which begins after the prolog script completes.

Known limitations

Before relying on the SUNK Pod Scheduler for production workloads, review the following constraints so you can plan around them.
  • No gang scheduling. The scheduler schedules each Pod as a separate Slurm job. Multi-node PodGroups aren’t supported. The scheduler is best suited for single-node workloads such as inference.
  • No bin-packing. Slurm doesn’t fill partially-used Nodes before moving to idle ones. This can spread Pods across Nodes and lead to GPU fragmentation, especially during scaling.
  • Kubernetes scheduling features not supported. PodAffinity, PodAntiAffinity, and topology spread constraints have no effect. Slurm makes all Node placement decisions. Node-level affinity (such as gpu.nvidia.com/model) is supported.
  • Static CPU allocation causes Node drains. Pods with Guaranteed QoS (CPU requests equal to limits) trigger Kubernetes static CPU pinning, which conflicts with Slurm’s CPU accounting and creates resource contention. This leads to Node drains in Slurm. For details and prevention steps, see Static CPU allocation and the SUNK Pod Scheduler.
  • Non-SUNK Pods are invisible to Slurm. Slurm can’t see Pods scheduled through the standard Kubernetes scheduler. Running non-DaemonSet Pods on Slurm Nodes without the SUNK Pod Scheduler can cause resource conflicts and unexpected Node drains.
  • Taints may conflict with Slurm placement. The SUNK Pod Scheduler doesn’t evaluate Kubernetes taints. If Slurm places a Pod on a Node with a conflicting taint, the kubelet rejects the Pod. Check Pod events with kubectl describe pod if a Pod is stuck.
  • Scale-down reconfigure delay. When Nodes are removed from the cluster, there’s about a one-minute delay before Slurm’s configuration updates. During this window, Slurm may try to schedule work onto Nodes that are being removed.

Next steps

Last modified on May 27, 2026