Schedule Kubernetes Pods with Slurm
This conceptual explanation is for customers who want to schedule Kubernetes Pods with the Slurm Scheduler and learn about the benefits of scheduling Kubernetes Pods with Slurm.
If your Slurm cluster includes a Scheduler component (which is part of SUNK by default), you can manage Kubernetes Pod scheduling within your Slurm cluster nodes, allowing for a unified approach to workload management.
Kubernetes Nodes are capitalized as proper nouns in this documentation, while Slurm nodes are lowercase.
In SUNK, each Slurm node is a Kubernetes Pod running a slurmd
container. This design enables dynamic provisioning and scaling of Slurm nodes using native Kubernetes mechanisms. Note that Slurm nodes (Pods) are distinct from Kubernetes Nodes, which are the underlying worker machines that host these Pods.
In a typical research cluster, jobs are submitted to Slurm using its familiar Slurm commands like srun
and sinfo
. However, some workloads, such as real-time inference, are better suited to run as standalone Kubernetes Pods. Managing Kubernetes and Slurm clusters separately can be cumbersome, and dynamically moving Nodes between them is often impractical. To address this, SUNK includes the SUNK Scheduler, a custom Kubernetes scheduler developed by CoreWeave, which allows users to schedule both Slurm jobs and Kubernetes Pods in the same cluster—for example, training and inference workloads coexisting seamlessly. The SUNK Scheduler is enabled by default.
As a prerequisite for scheduling Kubernetes Pods with Slurm, the Slurm cluster must be deployed with the SUNK Scheduler. When a Pod is marked for scheduling via the SUNK Scheduler, it creates a placeholder job in Slurm that the Slurm scheduler can manage, then allocates the Pod to the Node selected by Slurm.
The SUNK Scheduler operates as a conventional Kubernetes reconciler, watching Pod objects and events from the Slurm Controller. It ensures that Kubernetes Pods are scheduled on the same Kubernetes Nodes as Slurm nodes; that the state of the Slurm cluster is synchronized with Kubernetes; and that events in Kubernetes are propagated into Slurm. It can preempt Kubernetes workloads in favor of Slurm jobs running on the same Node, or vice versa, using the same logic.
Benefits
Kubernetes and Slurm are both powerful workload management and scheduling tools, each with different strengths. Kubernetes excels at container orchestration and microservice management, while Slurm is designed for high-performance computing (HPC) workloads and job scheduling.
By using SUNK to integrate Slurm with Kubernetes, you can:
- Leverage the strengths of both systems. Use Kubernetes for container orchestration and Slurm for HPC job scheduling.
- Gain flexibility. Run Kubernetes and Slurm workloads side by side in the same cluster, such as training with Slurm and serving inference with Kubernetes, while sharing compute resources to maximize efficiency.
- Simplify cluster management. Manage both Kubernetes and Slurm workloads from a single platform.
Limitations
When scheduling Kubernetes Pods on the same Nodes as Slurm jobs, be aware of the following hardware limitations:
- Node sharing: Either a Slurm job or a Kubernetes Pod can run on a given Node, but not both at the same time.
- Memory availability: Avoid setting high memory requests for the
slurmd
container, as this can reduce scheduling flexibility for both Slurm jobs and Kubernetes Pods. When memory requests are too high, it limits how many workloads can be placed on a Node, leading to underutilization and longer queue or wait times. Configuring more conservative memory requests (e.g., 50% of the Node's memory) allows for more efficient use of resources across the cluster.
Check the scheduler scope
First, ensure the Scheduler component is set up to watch the namespaces where your Pods are created. The Scheduler may be configured to watch one or multiple namespaces, or even the entire Kubernetes cluster. Set this via the scheduler.scope.type
and scheduler.scope.namespaces
parameters.
To monitor the entire cluster, set scheduler.scope.type
to cluster
and ensure you have access to the scheduler's ClusterRole deployed with the SUNK chart.
To monitor specific namespaces, set scheduler.scope.type
to namespace
and set scheduler.scope.namespaces
to a comma-separated list of namespaces. If scheduler.scope.namespaces
is left blank, the scheduler will default to the namespace it's deployed in.
Pod configuration
In your Pod's spec
, assign spec.schedulerName
to match the name of the Slurm Scheduler. The default name is typically <namespace>-<releaseName>
, but can be manually set in the values with scheduler.name
, so check your deployment details to find the exact name.
You can check the value of the --scheduler-name
parameter on the Scheduler Pod:
$kubectl get pods -l app.kubernetes.io/name=sunk-scheduler -oyaml | yq '.items[0].spec.containers[] | select(.name == "scheduler").args'- --health-probe-bind-address=:8081- --metrics-bind-address=:8080- --hooks-api-bind-address=:8000- --zap-log-level=debug- --components- scheduler- --slurm-auth-token=$(SLURM_TOKEN)- --slurm-api-base=slurm-controller:6817- --slurm-kill-wait=30s- --watch-namespace=slurm-staging- --scheduler-name=tenant-slurm-staging-slurm-scheduler
In this example, the scheduler name is tenant-slurm-staging-slurm-scheduler
.
Your Pods must request specific CPU and memory resources. If these resource requests aren't set or are zero, the Pod will fail to be scheduled. If your workload requires GPUs, define GPU resource requests. You may also set a Node affinity for gpu.nvidia.com/class
to specify the GPU type. For example:
affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: gpu.nvidia.com/classoperator: Invalues:- A100_PCIE_80GB
Certain resources, such as CPU and memory, are shared between Kubernetes and Slurm. To avoid errors, configure your workload's resources with both requirements in mind. See Manage Resources for more information.
The Pod's terminationGracePeriodSeconds
must be less than the configured --slurm-kill-wait
value passed to the scheduler minus 5s. For example if --slurm-kill-wait=30s
then the maximum terminationGracePeriodSeconds
is 24
as 24 < 30 - 5
. Pods that fail this check will not be scheduled and need to be recreated with a valid terminationGracePeriodSeconds
.
When deploying via the Slurm chart, the value passed to the scheduler is configured to match the value of KillWait
in the Slurm config. If the Slurm config is modified outside of the chart the scheduler deployment will also need to be updated with the new value.
Scheduling Kubernetes Pods with annotations
Optional annotations can be added to a Pod to configure Slurm-specific settings, such as the account to use, job priority, or the partition. Available annotations include:
SUNK Annotations Reference
The following annotations can be used to configure Slurm job parameters when scheduling Kubernetes Pods through SUNK.
Annotation | Type | Description | Example Value |
---|---|---|---|
sunk.coreweave.com/account | string | The Slurm account to use for the job. List available accounts with sacctmgr list account . | "root" |
sunk.coreweave.com/comment | string | A descriptive comment to add to the Slurm job. | "A Kubernetes job" |
sunk.coreweave.com/constraints | string | Node constraints for job placement. See the Slurm sbatch documentation for syntax. | |
For Sunk v5.7.0 and later: sunk.coreweave.com/exclusive | string | Run the job in exclusive mode, preventing other jobs from sharing the same nodes. See the Slurm exclusive documentation. Note: Refer to GPU resources to avoid problems when requesting GPU resources. | none , ok , user , mcs , topo |
For Sunk before v.5.7.0: sunk.coreweave.com/exclusive | boolean | Run the job in exclusive mode, preventing other jobs from sharing the same nodes. See the Slurm exclusive documentation. Note: Refer to GPU resources to avoid problems when requesting GPU resources. | "true" |
sunk.coreweave.com/excluded-nodes | string | Comma-separated list of Slurm node names to exclude from job placement. | "a100-092-01,a100-092-02" |
sunk.coreweave.com/required-nodes | string | Comma-separated list of Slurm node names required for job placement. | "a100-092-04,a100-092-06" |
sunk.coreweave.com/partition | string | The Slurm partition to submit the job to. | "default" |
sunk.coreweave.com/prefer | string | Node preferences for job placement. See the Slurm prefer documentation for syntax. | |
sunk.coreweave.com/priority | integer | Job priority value. Must be an integer; non-integer values will cause scheduling to fail. | "100" |
sunk.coreweave.com/qos | string | Quality of Service (QoS) level for the job. | "normal" |
sunk.coreweave.com/reservation | string | Slurm reservation name to use for the job. | "test" |
sunk.coreweave.com/user-id | integer | User ID to run the job as. List users and IDs with id <user> . | "1001" |
sunk.coreweave.com/group-id | integer | Group ID to run the job as. List groups and IDs with id <user> . | "1002" |
sunk.coreweave.com/current-working-dir | string | Working directory path for the job execution. | "/path/to/use" |
sunk.coreweave.com/timeout | integer | Time limit in seconds for the Slurm placeholder job. Must be an integer value. | "123" |
Here's an example of a manifest for a Pod that runs exclusively on a node, allowing no other Pods or Jobs to schedule:
apiVersion: v1kind: Podmetadata:name: slurm-testnamespace: tenant-slurm-stagingannotations:sunk.coreweave.com/account: rootsunk.coreweave.com/comment: "A Kubernetes job"sunk.coreweave.com/exclusive: "none"spec:schedulerName: tenant-slurm-staging-slurm-scheduler # Formatted as <namespace>-<scheduler-name>containers:- name: mainimage: ubuntucommand:- /bin/bashargs:- -c- sleep infinityresources:limits:memory: 960Girdma/ib: "1"nvidia.com/gpu: "8"requests:cpu: "110"memory: 960Ginvidia.com/gpu: "8"terminationGracePeriodSeconds: 5affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: gpu.nvidia.com/modeloperator: Invalues:- H100_NVLINK_80GB
- Invalid annotation values cause the Pod to fail scheduling.
- Changes to annotations after a Pod is scheduled do not affect the job.
- Slurm's Scheduler is not aware of Kubernetes factors such as taints or required affinities. If there is a conflict between Slurm's Scheduler and a taint, for example, a Pod may become stuck after being bound with kubelet rejecting it. Errors will be visible when describing the pod.
- The Slurm job will default to using the root user (uid=0, gid=0) when scheduling a job. If a user-id is specified without a corresponding group-id, the group-id will set to the user-id.
Manage Resources
When scheduling Kubernetes Pods within your Slurm cluster Nodes, it's important to understand how GPU, CPU, and memory resources are allocated when Kubernetes Pods and Slurm jobs coexist on the same Node.
Shared Resources
Kubernetes Pods and Slurm jobs share CPU and memory resources. This allows you to schedule them alongside each other on the same Node, but can lead to errors if configured incorrectly.
To avoid conflicts, lower the resource requests for the slurmd
container. The default SUNK NodeSets have to be changed to effectively schedule Kubernetes Pods via the SUNK scheduler. Failing to change the requests will lead to OutOfMemory
and OutOfcpu
rejections of Pods by the kubelet.
Add a compute definition to the NodeSets to lower the CPU and memory requests:
# Lower the requests so k8s workloads can be scheduled onto the nodes.# Keep the limits high so slurm jobs can still use the full node.low-requests:resources:requests:cpu: 10memory: 10Gi
Always account for the shared nature of CPU and memory resources when configuring the Scheduler.
CPU
slurmd
requests subtract from the CPU resources available for Pods scheduled via the Slurm Scheduler integration. If a Kubernetes Pod requires CPU resources beyond those available on the Kubernetes Pod after accounting for slurmd
requests, it can lead to an OutOfcpu
rejection by the kubelet.
Memory
If the slurmd
container is configured with high memory requests, it may prevent Kubernetes Pods from being scheduled on the same Node. Configure the slurmd
container with lower memory requests, such as 10Gi, to allow Kubernetes Pods to be scheduled alongside Slurm jobs. OutOfMemory
kubelet rejections of pods can still happen if packing the node completely full via the scheduler. MemSpecLimit
can be used in the slurm NodeSet configuration to reserve memory for slurmd
and other system Pods (such as DaemonSets). To see an example manifest for changing slurmd
resources, refer to Configure Compute nodes.
The Slurm scheduler will not know about any pods scheduled using the standard Kubernetes scheduler. Scheduling pods (except for DaemonSets) on nodes running Slurm is highly discouraged.
Oversubscription
Out-of-memory (OOM) errors can occur if oversubscription is allowed. Oversubscription enables multiple Slurm jobs to run on the same Node, even when their combined memory requirements exceed the Node's memory capacity. To prevent OOM errors and ensure system stability, configure memory as a consumable resource by setting SelectTypeParameters
to CR_CPU_Memory
, CR_Core_Memory
, or CR_Socket_Memory
.
Setting SelectTypeParameters
to CR_CPU
, CR_Core
, or CR_Socket
is discouraged because these values allow memory oversubscription.
GPU resources
GPUs cannot be shared between Kubernetes Pods and Slurm jobs. The Kubernetes k8s-device-plugin and Slurm's GPU allocation mechanisms are incompatible, leading to conflicts if both are used simultaneously.
When scheduling Kubernetes Pods using GPUs, you have the following options:
- When a Pod uses all GPUs in a node, or it is otherwise acceptable to reserve the full node for the job, use
sunk.coreweave.com/exclusive: "none"
. - To dedicate a specific set of nodes for Kubernetes Pods, you can create a reservation in the Slurm and Kubernetes pods schedule with the following reservation annotation:
sunk.coreweave.com/reservation
. This annotation also ensures that the Pod is scheduled on a Node with no other Slurm jobs running. - Create a dedicated user in your Slurm environment for Kubernetes Pods, and set exclusivity to be per user, as well as supply the
user-id
of the Kubernetes user to all Kubernetes Pods. This will allow only other Kubernetes Pods to share the same node and allocate GPUs. This is the best option when running Kubernetes Pods that request less than a full node's worth of GPUs.
apiVersion: v1kind: Podmetadata:name: slurm-testnamespace: tenant-slurm-stagingannotations:sunk.coreweave.com/account: rootsunk.coreweave.com/comment: "A single GPU Pod"sunk.coreweave.com/exclusive: "user"sunk.coreweave.com/user-id: "1001" # The user id of the special Slurm user created for use by Kubernetes jobsspec:schedulerName: tenant-slurm-staging-slurm-scheduler # Formatted as <namespace>-<scheduler-name>containers:- name: mainimage: ubuntucommand:- /bin/bashargs:- -c- sleep infinityresources:limits:memory: 256Ginvidia.com/gpu: "1"requests:cpu: "8"memory: 128Ginvidia.com/gpu: "1"terminationGracePeriodSeconds: 5affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: gpu.nvidia.com/modeloperator: Invalues:- H100_NVLINK_80GB
Time limits
A Slurm job's time limit impacts a Kubernetes Pod scheduled through Slurm differently than a normal Slurm job. The Kubernetes time limit begins when a Pod is scheduled onto a Node. It includes the time spent initializing the Pod before the container is ready and the prolog script starts. Slurm's time limit begins later, after the prolog script is complete and the job is fully running. The Slurm time limit can be specified using the sunk.coreweave.com/timeout
annotation.