Schedule Kubernetes Pods with the SUNK Pod Scheduler

The SUNK Pod Scheduler lets you run Kubernetes Pods alongside Slurm jobs in the same cluster. Instead of managing separate clusters for different workload types, you can run training with Slurm and inference with Kubernetes on shared Nodes. When a Pod is assigned to the SUNK Pod Scheduler, it creates a placeholder Slurm job so that Slurm manages Node placement, GPU allocation, and resource tracking for the Pod.

Enable the scheduler

Set scheduler.enabled to true in your Slurm Helm chart values:

scheduler:
  enabled: true

Configure the scheduler’s scope to watch the namespaces where your Pods will be created:

To monitor the entire cluster, set scheduler.scope.type to cluster.
To monitor specific namespaces, set scheduler.scope.type to namespace and set scheduler.scope.namespaces to a comma-separated list of namespaces. If scheduler.scope.namespaces is left blank, the scheduler defaults to the namespace it is deployed in.

Configure a Pod for scheduling

There are three things every SUNK-scheduled Pod needs: the scheduler name, resource requests, and a valid termination grace period.

1. Set the scheduler name

In your Pod’s spec, set schedulerName to the name of your SUNK Pod Scheduler. The default name is typically <namespace>-<releaseName>, but it can be set manually with scheduler.name in your Helm values.

2. Set resource requests

Your Pod must request specific CPU and memory resources. If these resource requests are missing or zero, the Pod will fail to schedule.

3. Set the termination grace period

Set terminationGracePeriodSeconds to a value that is strictly less than the scheduler’s --slurm-kill-wait minus 5 seconds. For example, if --slurm-kill-wait=30s (the default), then terminationGracePeriodSeconds must be less than 25.

The Kubernetes default terminationGracePeriodSeconds is 30 seconds, which exceeds the default threshold of 25 seconds. You must explicitly set this value on your Pod spec. A value of 5 to 20 is typical.

Look up the scheduler configuration

Run this command to find both the scheduler name and the kill-wait value:

kubectl get pods -l app.kubernetes.io/name=sunk-scheduler -oyaml | yq '.items[0].spec.containers[] | select(.name == "scheduler").args'

Look for --scheduler-name and --slurm-kill-wait in the output:

- --scheduler-name=tenant-slurm-staging-slurm-scheduler
- --slurm-kill-wait=30s

Example Pod

This example shows a full-node GPU Pod with all three required settings:

apiVersion: v1
kind: Pod
metadata:
  name: slurm-test
  namespace: tenant-slurm-staging
  annotations:
    sunk.coreweave.com/account: root
    sunk.coreweave.com/comment: "A Kubernetes job"
    sunk.coreweave.com/exclusive: "none"
spec:
  schedulerName: tenant-slurm-staging-slurm-scheduler
  containers:
    - name: main
      image: ubuntu
      command:
        - /bin/bash
      args:
        - -c
        - sleep infinity
      resources:
        limits:
          memory: 960Gi
          rdma/ib: "1"
          nvidia.com/gpu: "8"
        requests:
          cpu: "10"
          memory: 10Gi
          nvidia.com/gpu: "8"
  terminationGracePeriodSeconds: 5
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: gpu.nvidia.com/model
                operator: In
                values:
                  - H100_NVLINK_80GB

For GPU Pods, set a Node affinity for gpu.nvidia.com/model to target the GPU type you need. The CPU, memory, and GPU values in this example are illustrative. Adjust them based on your Node type and slurmd configuration. See Check available resources for guidance. How your Pod shares Nodes with other workloads depends on the sunk.coreweave.com/exclusive annotation. The right value depends on your workload:

Workload type	Annotation	Effect
Pod needs the entire Node (all GPUs, or full CPU Node)	`exclusive: "none"`	Full Node isolation. No other jobs can share the Node.
Multiple Pods or jobs share a Node (partial GPUs, or CPU pool)	`exclusive: "user"`	Jobs from the same Slurm user can share the Node.
Pod can share a Node with any workload	`exclusive: "ok"`	Open sharing with any user or account.

For detailed guidance on each scenario, including resource configuration and examples, see Manage resources with the SUNK Pod Scheduler.

Annotations reference

These annotations configure Slurm job parameters for SUNK-scheduled Pods. All annotations are optional.

Annotation	Type	Description	Example
`sunk.coreweave.com/account`	string	Slurm account for the job. List accounts with `sacctmgr list account`.	`"root"`
`sunk.coreweave.com/comment`	string	Descriptive comment added to the Slurm job.	`"A Kubernetes job"`
`sunk.coreweave.com/constraints`	string	Node constraints for job placement. See the Slurm sbatch documentation for syntax.
`sunk.coreweave.com/exclusive` (v5.7.0+)	string	Controls Node sharing. See Exclusive annotation values.	`"none"`, `"ok"`, `"user"`
`sunk.coreweave.com/exclusive` (before v5.7.0)	boolean	Enables exclusive mode. See the Slurm exclusive documentation.	`"true"`
`sunk.coreweave.com/excluded-nodes`	string	Comma-separated Slurm node names to exclude.	`"a100-092-01,a100-092-02"`
`sunk.coreweave.com/required-nodes`	string	Comma-separated Slurm node names required.	`"a100-092-04,a100-092-06"`
`sunk.coreweave.com/partition`	string	Slurm partition to submit the job to.	`"default"`
`sunk.coreweave.com/prefer`	string	Node preferences. See Slurm prefer documentation.
`sunk.coreweave.com/priority`	integer	Job priority value. Must be an integer.	`"100"`
`sunk.coreweave.com/qos`	string	Quality of Service level.	`"normal"`
`sunk.coreweave.com/reservation`	string	Slurm reservation name.	`"test"`
`sunk.coreweave.com/user-id`	integer	User ID to run the job as.	`"1001"`
`sunk.coreweave.com/group-id`	integer	Group ID to run the job as.	`"1002"`
`sunk.coreweave.com/current-working-dir`	string	Working directory path for the job.	`"/path/to/use"`
`sunk.coreweave.com/timeout`	integer	Time limit in minutes for the Slurm placeholder job.	`"123"`

Invalid annotation values cause the Pod to fail scheduling.
Changes to annotations after a Pod is scheduled do not affect the running job.
The Slurm job defaults to the root user (uid=0, gid=0). If user-id is set without group-id, the group ID is set to the user ID value.

Time limits

The sunk.coreweave.com/timeout annotation sets a time limit (in minutes) on the Slurm placeholder job. The Kubernetes time limit begins when the Pod is scheduled onto a Node, which includes Pod initialization time before the container starts. This is earlier than Slurm’s normal time limit, which begins after the prolog script completes.

Known limitations

No gang scheduling. Each Pod is scheduled as a separate Slurm job. Multi-node PodGroups are not supported. The scheduler is best suited for single-node workloads such as inference.
No bin-packing. Slurm does not fill partially-used Nodes before moving to idle ones. This can spread Pods across Nodes and lead to GPU fragmentation, especially during scaling.
Kubernetes scheduling features not supported. PodAffinity, PodAntiAffinity, and topology spread constraints have no effect. All Node placement decisions are made by Slurm. Node-level affinity (such as gpu.nvidia.com/model) is supported.
Static CPU allocation causes Node drains. Pods with Guaranteed QoS (CPU requests equal to limits) trigger Kubernetes static CPU pinning, which conflicts with Slurm’s CPU accounting and creates resource contention. This ultimately leads to node drains in Slurm. See Static CPU allocation and the SUNK Pod Scheduler for details and prevention steps.
Non-SUNK Pods are invisible to Slurm. Slurm cannot see Pods scheduled through the standard Kubernetes scheduler. Running non-DaemonSet Pods on Slurm Nodes without the SUNK Pod Scheduler can cause resource conflicts and unexpected Node drains.
Taints may conflict with Slurm placement. The SUNK Pod Scheduler does not evaluate Kubernetes taints. If Slurm places a Pod on a Node with a conflicting taint, the kubelet rejects the Pod. Check Pod events with kubectl describe pod if a Pod is stuck.
Scale-down reconfigure delay. When Nodes are removed from the cluster, there is an approximately one-minute delay before Slurm’s configuration updates. During this window, Slurm may try to schedule work onto Nodes that are being removed.

Next steps

Manage resources with the SUNK Pod Scheduler for Node sharing configuration, resource tuning, and GPU allocation guidance.

SUNK

Documentation Index

​Enable the scheduler

​Configure a Pod for scheduling

​1. Set the scheduler name

​2. Set resource requests

​3. Set the termination grace period

​Look up the scheduler configuration

​Example Pod

​Choose a node sharing strategy

​Annotations reference

​Time limits

​Known limitations

​Next steps

Enable the scheduler

Configure a Pod for scheduling

1. Set the scheduler name

2. Set resource requests

3. Set the termination grace period

Look up the scheduler configuration

Example Pod

Choose a node sharing strategy

Annotations reference

Time limits

Known limitations

Next steps