The SUNK Pod Scheduler lets you run Kubernetes Pods alongside Slurm jobs in the same cluster. Instead of managing separate clusters for different workload types, you can run training with Slurm and inference with Kubernetes on shared Nodes. When a Pod is assigned to the SUNK Pod Scheduler, it creates a placeholder Slurm job so that Slurm manages Node placement, GPU allocation, and resource tracking for the Pod.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Enable the scheduler
Setscheduler.enabled to true in your Slurm Helm chart values:
- To monitor the entire cluster, set
scheduler.scope.typetocluster. - To monitor specific namespaces, set
scheduler.scope.typetonamespaceand setscheduler.scope.namespacesto a comma-separated list of namespaces. Ifscheduler.scope.namespacesis left blank, the scheduler defaults to the namespace it is deployed in.
Configure a Pod for scheduling
There are three things every SUNK-scheduled Pod needs: the scheduler name, resource requests, and a valid termination grace period.1. Set the scheduler name
In your Pod’sspec, set schedulerName to the name of your SUNK Pod Scheduler. The default name is typically <namespace>-<releaseName>, but it can be set manually with scheduler.name in your Helm values.
2. Set resource requests
Your Pod must request specific CPU and memory resources. If these resource requests are missing or zero, the Pod will fail to schedule.3. Set the termination grace period
SetterminationGracePeriodSeconds to a value that is strictly less than the scheduler’s --slurm-kill-wait minus 5 seconds. For example, if --slurm-kill-wait=30s (the default), then terminationGracePeriodSeconds must be less than 25.
Look up the scheduler configuration
Run this command to find both the scheduler name and the kill-wait value:--scheduler-name and --slurm-kill-wait in the output:
Example Pod
This example shows a full-node GPU Pod with all three required settings:gpu.nvidia.com/model to target the GPU type you need. The CPU, memory, and GPU values in this example are illustrative. Adjust them based on your Node type and slurmd configuration. See Check available resources for guidance.
Choose a node sharing strategy
How your Pod shares Nodes with other workloads depends on thesunk.coreweave.com/exclusive annotation. The right value depends on your workload:
| Workload type | Annotation | Effect |
|---|---|---|
| Pod needs the entire Node (all GPUs, or full CPU Node) | exclusive: "none" | Full Node isolation. No other jobs can share the Node. |
| Multiple Pods or jobs share a Node (partial GPUs, or CPU pool) | exclusive: "user" | Jobs from the same Slurm user can share the Node. |
| Pod can share a Node with any workload | exclusive: "ok" | Open sharing with any user or account. |
Annotations reference
These annotations configure Slurm job parameters for SUNK-scheduled Pods. All annotations are optional.| Annotation | Type | Description | Example |
|---|---|---|---|
sunk.coreweave.com/account | string | Slurm account for the job. List accounts with sacctmgr list account. | "root" |
sunk.coreweave.com/comment | string | Descriptive comment added to the Slurm job. | "A Kubernetes job" |
sunk.coreweave.com/constraints | string | Node constraints for job placement. See the Slurm sbatch documentation for syntax. | |
sunk.coreweave.com/exclusive (v5.7.0+) | string | Controls Node sharing. See Exclusive annotation values. | "none", "ok", "user" |
sunk.coreweave.com/exclusive (before v5.7.0) | boolean | Enables exclusive mode. See the Slurm exclusive documentation. | "true" |
sunk.coreweave.com/excluded-nodes | string | Comma-separated Slurm node names to exclude. | "a100-092-01,a100-092-02" |
sunk.coreweave.com/required-nodes | string | Comma-separated Slurm node names required. | "a100-092-04,a100-092-06" |
sunk.coreweave.com/partition | string | Slurm partition to submit the job to. | "default" |
sunk.coreweave.com/prefer | string | Node preferences. See Slurm prefer documentation. | |
sunk.coreweave.com/priority | integer | Job priority value. Must be an integer. | "100" |
sunk.coreweave.com/qos | string | Quality of Service level. | "normal" |
sunk.coreweave.com/reservation | string | Slurm reservation name. | "test" |
sunk.coreweave.com/user-id | integer | User ID to run the job as. | "1001" |
sunk.coreweave.com/group-id | integer | Group ID to run the job as. | "1002" |
sunk.coreweave.com/current-working-dir | string | Working directory path for the job. | "/path/to/use" |
sunk.coreweave.com/timeout | integer | Time limit in minutes for the Slurm placeholder job. | "123" |
Time limits
Thesunk.coreweave.com/timeout annotation sets a time limit (in minutes) on the Slurm placeholder job. The Kubernetes time limit begins when the Pod is scheduled onto a Node, which includes Pod initialization time before the container starts. This is earlier than Slurm’s normal time limit, which begins after the prolog script completes.
Known limitations
- No gang scheduling. Each Pod is scheduled as a separate Slurm job. Multi-node PodGroups are not supported. The scheduler is best suited for single-node workloads such as inference.
- No bin-packing. Slurm does not fill partially-used Nodes before moving to idle ones. This can spread Pods across Nodes and lead to GPU fragmentation, especially during scaling.
- Kubernetes scheduling features not supported. PodAffinity, PodAntiAffinity, and topology spread constraints have no effect. All Node placement decisions are made by Slurm. Node-level affinity (such as
gpu.nvidia.com/model) is supported. - Static CPU allocation causes Node drains. Pods with Guaranteed QoS (CPU requests equal to limits) trigger Kubernetes static CPU pinning, which conflicts with Slurm’s CPU accounting and creates resource contention. This ultimately leads to node drains in Slurm. See Static CPU allocation and the SUNK Pod Scheduler for details and prevention steps.
- Non-SUNK Pods are invisible to Slurm. Slurm cannot see Pods scheduled through the standard Kubernetes scheduler. Running non-DaemonSet Pods on Slurm Nodes without the SUNK Pod Scheduler can cause resource conflicts and unexpected Node drains.
- Taints may conflict with Slurm placement. The SUNK Pod Scheduler does not evaluate Kubernetes taints. If Slurm places a Pod on a Node with a conflicting taint, the kubelet rejects the Pod. Check Pod events with
kubectl describe podif a Pod is stuck. - Scale-down reconfigure delay. When Nodes are removed from the cluster, there is an approximately one-minute delay before Slurm’s configuration updates. During this window, Slurm may try to schedule work onto Nodes that are being removed.
Next steps
- Manage resources with the SUNK Pod Scheduler for Node sharing configuration, resource tuning, and GPU allocation guidance.