The SUNK Pod Scheduler lets you run Kubernetes Pods alongside Slurm jobs in the same cluster. Instead of managing separate clusters for different workload types, you can run training with Slurm and inference with Kubernetes on shared Nodes. When a Pod is assigned to the SUNK Pod Scheduler, it creates a placeholder Slurm job so that Slurm manages Node placement, GPU allocation, and resource tracking for the Pod. This page walks through how to enable the scheduler, configure a Pod to use it, choose a Node sharing strategy, and the annotations and limitations you should know about. It’s intended for cluster operators and workload owners who want to consolidate Slurm and Kubernetes workloads on shared infrastructure.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Enable the scheduler
Before you can submit Pods to the SUNK Pod Scheduler, you must enable it in the Slurm Helm chart and tell it which namespaces to watch. Setscheduler.enabled to true in your Slurm Helm chart values:
- To monitor the entire cluster, set
scheduler.scope.typetocluster. - To monitor specific namespaces, set
scheduler.scope.typetonamespaceand setscheduler.scope.namespacesto a comma-separated list of namespaces. Ifscheduler.scope.namespacesis blank, the scheduler defaults to the namespace where it’s deployed.
Configure a Pod for scheduling
With the scheduler enabled, each Pod that targets it needs three pieces of configuration: the scheduler name, resource requests, and a valid termination grace period. The following sections describe each requirement.Set the scheduler name
In your Pod’sspec, set schedulerName to the name of your SUNK Pod Scheduler so Kubernetes routes the Pod to SUNK instead of the default scheduler. The default name is <namespace>-<releaseName>, but you can set it manually with scheduler.name in your Helm values.
Set resource requests
Your Pod must request specific CPU and memory resources so Slurm can account for the Pod against Node capacity. If these resource requests are missing or zero, the Pod fails to schedule.Set the termination grace period
SetterminationGracePeriodSeconds to a value strictly less than the scheduler’s --slurm-kill-wait minus 5 seconds. This keeps Kubernetes Pod termination aligned with Slurm’s job termination window so the placeholder job and the Pod tear down cleanly. For example, if --slurm-kill-wait=30s (the default), then terminationGracePeriodSeconds must be less than 25.
Look up the scheduler configuration
If you don’t know the scheduler name or the current--slurm-kill-wait value, you can read both directly from the running scheduler Pod. Run this command to find both the scheduler name and the kill-wait value:
--scheduler-name and --slurm-kill-wait in the output:
Example Pod
This example shows a full-node GPU Pod with all three required settings:gpu.nvidia.com/model to target the GPU type you need. The CPU, memory, and GPU values in this example are illustrative. Adjust them based on your Node type and slurmd configuration. For guidance, see Check available resources.
Choose a Node sharing strategy
After the Pod has the required fields, decide how it shares Nodes with other workloads. Thesunk.coreweave.com/exclusive annotation controls this behavior, and the right value depends on your workload:
| Workload type | Annotation | Effect |
|---|---|---|
| Pod needs the entire Node (all GPUs, or full CPU Node) | exclusive: "none" | Full Node isolation. No other jobs can share the Node. |
| Multiple Pods or jobs share a Node (partial GPUs, or CPU pool) | exclusive: "user" | Jobs from the same Slurm user can share the Node. |
| Pod can share a Node with any workload | exclusive: "ok" | Open sharing with any user or account. |
Annotations reference
These annotations configure Slurm job parameters for SUNK-scheduled Pods. All annotations are optional.| Annotation | Type | Description | Example |
|---|---|---|---|
sunk.coreweave.com/account | string | Slurm account for the job. List accounts with sacctmgr list account. | "root" |
sunk.coreweave.com/comment | string | Descriptive comment to add to the Slurm job. | "A Kubernetes job" |
sunk.coreweave.com/constraints | string | Node constraints for job placement. See the Slurm sbatch documentation for syntax. | |
sunk.coreweave.com/exclusive (v5.7.0+) | string | Controls Node sharing. See Exclusive annotation values. | "none", "ok", "user" |
sunk.coreweave.com/exclusive (before v5.7.0) | boolean | Enables exclusive mode. See the Slurm exclusive documentation. | "true" |
sunk.coreweave.com/excluded-nodes | string | Comma-separated Slurm node names to exclude. | "a100-092-01,a100-092-02" |
sunk.coreweave.com/required-nodes | string | Comma-separated Slurm node names required. | "a100-092-04,a100-092-06" |
sunk.coreweave.com/partition | string | Slurm partition to submit the job to. | "default" |
sunk.coreweave.com/prefer | string | Node preferences. See Slurm prefer documentation. | |
sunk.coreweave.com/priority | integer | Job priority value. Must be an integer. | "100" |
sunk.coreweave.com/qos | string | Quality of Service level. | "normal" |
sunk.coreweave.com/reservation | string | Slurm reservation name. | "test" |
sunk.coreweave.com/user-id | integer | User ID to run the job as. | "1001" |
sunk.coreweave.com/group-id | integer | Group ID to run the job as. | "1002" |
sunk.coreweave.com/current-working-dir | string | Working directory path for the job. | "/path/to/use" |
sunk.coreweave.com/timeout | integer | Time limit in minutes for the Slurm placeholder job. | "123" |
Time limits
Thesunk.coreweave.com/timeout annotation sets a time limit (in minutes) on the Slurm placeholder job. The Kubernetes time limit begins when the scheduler places the Pod onto a Node, which includes Pod initialization time before the container starts. This is earlier than Slurm’s normal time limit, which begins after the prolog script completes.
Known limitations
Before relying on the SUNK Pod Scheduler for production workloads, review the following constraints so you can plan around them.- No gang scheduling. The scheduler schedules each Pod as a separate Slurm job. Multi-node PodGroups aren’t supported. The scheduler is best suited for single-node workloads such as inference.
- No bin-packing. Slurm doesn’t fill partially-used Nodes before moving to idle ones. This can spread Pods across Nodes and lead to GPU fragmentation, especially during scaling.
- Kubernetes scheduling features not supported. PodAffinity, PodAntiAffinity, and topology spread constraints have no effect. Slurm makes all Node placement decisions. Node-level affinity (such as
gpu.nvidia.com/model) is supported. - Static CPU allocation causes Node drains. Pods with Guaranteed QoS (CPU requests equal to limits) trigger Kubernetes static CPU pinning, which conflicts with Slurm’s CPU accounting and creates resource contention. This leads to Node drains in Slurm. For details and prevention steps, see Static CPU allocation and the SUNK Pod Scheduler.
- Non-SUNK Pods are invisible to Slurm. Slurm can’t see Pods scheduled through the standard Kubernetes scheduler. Running non-DaemonSet Pods on Slurm Nodes without the SUNK Pod Scheduler can cause resource conflicts and unexpected Node drains.
- Taints may conflict with Slurm placement. The SUNK Pod Scheduler doesn’t evaluate Kubernetes taints. If Slurm places a Pod on a Node with a conflicting taint, the kubelet rejects the Pod. Check Pod events with
kubectl describe podif a Pod is stuck. - Scale-down reconfigure delay. When Nodes are removed from the cluster, there’s about a one-minute delay before Slurm’s configuration updates. During this window, Slurm may try to schedule work onto Nodes that are being removed.
Next steps
- Manage resources with the SUNK Pod Scheduler for Node sharing configuration, resource tuning, and GPU allocation guidance.