Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

The SUNK Pod Scheduler lets you run Kubernetes Pods alongside Slurm jobs in the same cluster. Instead of managing separate clusters for different workload types, you can run training with Slurm and inference with Kubernetes on shared Nodes. When a Pod is assigned to the SUNK Pod Scheduler, it creates a placeholder Slurm job so that Slurm manages Node placement, GPU allocation, and resource tracking for the Pod.

Enable the scheduler

Set scheduler.enabled to true in your Slurm Helm chart values:
scheduler:
  enabled: true
Configure the scheduler’s scope to watch the namespaces where your Pods will be created:
  • To monitor the entire cluster, set scheduler.scope.type to cluster.
  • To monitor specific namespaces, set scheduler.scope.type to namespace and set scheduler.scope.namespaces to a comma-separated list of namespaces. If scheduler.scope.namespaces is left blank, the scheduler defaults to the namespace it is deployed in.

Configure a Pod for scheduling

There are three things every SUNK-scheduled Pod needs: the scheduler name, resource requests, and a valid termination grace period.

1. Set the scheduler name

In your Pod’s spec, set schedulerName to the name of your SUNK Pod Scheduler. The default name is typically <namespace>-<releaseName>, but it can be set manually with scheduler.name in your Helm values.

2. Set resource requests

Your Pod must request specific CPU and memory resources. If these resource requests are missing or zero, the Pod will fail to schedule.

3. Set the termination grace period

Set terminationGracePeriodSeconds to a value that is strictly less than the scheduler’s --slurm-kill-wait minus 5 seconds. For example, if --slurm-kill-wait=30s (the default), then terminationGracePeriodSeconds must be less than 25.
The Kubernetes default terminationGracePeriodSeconds is 30 seconds, which exceeds the default threshold of 25 seconds. You must explicitly set this value on your Pod spec. A value of 5 to 20 is typical.

Look up the scheduler configuration

Run this command to find both the scheduler name and the kill-wait value:
kubectl get pods -l app.kubernetes.io/name=sunk-scheduler -oyaml | yq '.items[0].spec.containers[] | select(.name == "scheduler").args'
Look for --scheduler-name and --slurm-kill-wait in the output:
- --scheduler-name=tenant-slurm-staging-slurm-scheduler
- --slurm-kill-wait=30s

Example Pod

This example shows a full-node GPU Pod with all three required settings:
apiVersion: v1
kind: Pod
metadata:
  name: slurm-test
  namespace: tenant-slurm-staging
  annotations:
    sunk.coreweave.com/account: root
    sunk.coreweave.com/comment: "A Kubernetes job"
    sunk.coreweave.com/exclusive: "none"
spec:
  schedulerName: tenant-slurm-staging-slurm-scheduler
  containers:
    - name: main
      image: ubuntu
      command:
        - /bin/bash
      args:
        - -c
        - sleep infinity
      resources:
        limits:
          memory: 960Gi
          rdma/ib: "1"
          nvidia.com/gpu: "8"
        requests:
          cpu: "10"
          memory: 10Gi
          nvidia.com/gpu: "8"
  terminationGracePeriodSeconds: 5
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: gpu.nvidia.com/model
                operator: In
                values:
                  - H100_NVLINK_80GB
For GPU Pods, set a Node affinity for gpu.nvidia.com/model to target the GPU type you need. The CPU, memory, and GPU values in this example are illustrative. Adjust them based on your Node type and slurmd configuration. See Check available resources for guidance.

Choose a node sharing strategy

How your Pod shares Nodes with other workloads depends on the sunk.coreweave.com/exclusive annotation. The right value depends on your workload:
Workload typeAnnotationEffect
Pod needs the entire Node (all GPUs, or full CPU Node)exclusive: "none"Full Node isolation. No other jobs can share the Node.
Multiple Pods or jobs share a Node (partial GPUs, or CPU pool)exclusive: "user"Jobs from the same Slurm user can share the Node.
Pod can share a Node with any workloadexclusive: "ok"Open sharing with any user or account.
For detailed guidance on each scenario, including resource configuration and examples, see Manage resources with the SUNK Pod Scheduler.

Annotations reference

These annotations configure Slurm job parameters for SUNK-scheduled Pods. All annotations are optional.
AnnotationTypeDescriptionExample
sunk.coreweave.com/accountstringSlurm account for the job. List accounts with sacctmgr list account."root"
sunk.coreweave.com/commentstringDescriptive comment added to the Slurm job."A Kubernetes job"
sunk.coreweave.com/constraintsstringNode constraints for job placement. See the Slurm sbatch documentation for syntax.
sunk.coreweave.com/exclusive (v5.7.0+)stringControls Node sharing. See Exclusive annotation values."none", "ok", "user"
sunk.coreweave.com/exclusive (before v5.7.0)booleanEnables exclusive mode. See the Slurm exclusive documentation."true"
sunk.coreweave.com/excluded-nodesstringComma-separated Slurm node names to exclude."a100-092-01,a100-092-02"
sunk.coreweave.com/required-nodesstringComma-separated Slurm node names required."a100-092-04,a100-092-06"
sunk.coreweave.com/partitionstringSlurm partition to submit the job to."default"
sunk.coreweave.com/preferstringNode preferences. See Slurm prefer documentation.
sunk.coreweave.com/priorityintegerJob priority value. Must be an integer."100"
sunk.coreweave.com/qosstringQuality of Service level."normal"
sunk.coreweave.com/reservationstringSlurm reservation name."test"
sunk.coreweave.com/user-idintegerUser ID to run the job as."1001"
sunk.coreweave.com/group-idintegerGroup ID to run the job as."1002"
sunk.coreweave.com/current-working-dirstringWorking directory path for the job."/path/to/use"
sunk.coreweave.com/timeoutintegerTime limit in minutes for the Slurm placeholder job."123"
  • Invalid annotation values cause the Pod to fail scheduling.
  • Changes to annotations after a Pod is scheduled do not affect the running job.
  • The Slurm job defaults to the root user (uid=0, gid=0). If user-id is set without group-id, the group ID is set to the user ID value.

Time limits

The sunk.coreweave.com/timeout annotation sets a time limit (in minutes) on the Slurm placeholder job. The Kubernetes time limit begins when the Pod is scheduled onto a Node, which includes Pod initialization time before the container starts. This is earlier than Slurm’s normal time limit, which begins after the prolog script completes.

Known limitations

  • No gang scheduling. Each Pod is scheduled as a separate Slurm job. Multi-node PodGroups are not supported. The scheduler is best suited for single-node workloads such as inference.
  • No bin-packing. Slurm does not fill partially-used Nodes before moving to idle ones. This can spread Pods across Nodes and lead to GPU fragmentation, especially during scaling.
  • Kubernetes scheduling features not supported. PodAffinity, PodAntiAffinity, and topology spread constraints have no effect. All Node placement decisions are made by Slurm. Node-level affinity (such as gpu.nvidia.com/model) is supported.
  • Static CPU allocation causes Node drains. Pods with Guaranteed QoS (CPU requests equal to limits) trigger Kubernetes static CPU pinning, which conflicts with Slurm’s CPU accounting and creates resource contention. This ultimately leads to node drains in Slurm. See Static CPU allocation and the SUNK Pod Scheduler for details and prevention steps.
  • Non-SUNK Pods are invisible to Slurm. Slurm cannot see Pods scheduled through the standard Kubernetes scheduler. Running non-DaemonSet Pods on Slurm Nodes without the SUNK Pod Scheduler can cause resource conflicts and unexpected Node drains.
  • Taints may conflict with Slurm placement. The SUNK Pod Scheduler does not evaluate Kubernetes taints. If Slurm places a Pod on a Node with a conflicting taint, the kubelet rejects the Pod. Check Pod events with kubectl describe pod if a Pod is stuck.
  • Scale-down reconfigure delay. When Nodes are removed from the cluster, there is an approximately one-minute delay before Slurm’s configuration updates. During this window, Slurm may try to schedule work onto Nodes that are being removed.

Next steps

Last modified on April 20, 2026