Scheduling Kubernetes Pods with Slurm
This conceptual explanation is for customers who want to schedule Kubernetes Pods with the Slurm Scheduler and learn about the benefits of scheduling Kubernetes Pods with Slurm.
If your Slurm cluster includes a Scheduler component (which is part of SUNK by default), you can manage Kubernetes Pod scheduling within your Slurm cluster nodes, allowing for a unified approach to workload management.
Kubernetes Nodes are capitalized as proper nouns in this documentation, while Slurm nodes are lowercase.
In SUNK, each Slurm node is a Kubernetes Pod running a slurmd
container. This design enables dynamic provisioning and scaling of Slurm nodes using native Kubernetes mechanisms. Note that Slurm nodes (Pods) are distinct from Kubernetes Nodes, which are the underlying worker machines that host these Pods.
In a typical research cluster, jobs are submitted to Slurm using its familiar Slurm commands like srun
and sinfo
. However, some workloads, such as real-time inference, are better suited to run as standalone Kubernetes Pods. Managing Kubernetes and Slurm clusters separately can be cumbersome, and dynamically moving Nodes between them is often impractical. To address this, SUNK includes the SUNK Scheduler, a custom Kubernetes scheduler, which allows users to schedule both Slurm jobs and Kubernetes Pods in the same cluster—for example, training and inference workloads coexisting seamlessly. The SUNK Scheduler is enabled by default.
As a prerequisite for scheduling Kubernetes Pods with Slurm, the Slurm cluster must be deployed with the SUNK Scheduler. When a Pod is marked for scheduling via the SUNK Scheduler, it creates a placeholder job in Slurm that the Slurm scheduler can manage, then allocates the Pod to the Node selected by Slurm.
The SUNK Scheduler operates as a conventional Kubernetes reconciler, watching Pod objects and events from the Slurm Controller. It ensures that Kubernetes Pods are scheduled on the same Kubernetes Nodes as Slurm nodes; that the state of the Slurm cluster is synchronized with Kubernetes; and that events in Kubernetes are propagated into Slurm. It can preempt Kubernetes workloads in favor of Slurm jobs running on the same Node, or vice versa, using the same logic.
Benefits
Kubernetes and Slurm are both powerful workload management and scheduling tools, each with different strengths. Kubernetes excels at container orchestration and microservice management, while Slurm is designed for high-performance computing (HPC) workloads and job scheduling.
By using SUNK to integrate Slurm with Kubernetes, you can:
- Leverage the strengths of both systems. Use Kubernetes for container orchestration and Slurm for HPC job scheduling.
- Gain flexibility. Run Kubernetes and Slurm workloads side by side in the same cluster, such as training with Slurm and serving inference with Kubernetes, while sharing compute resources to maximize efficiency.
- Simplify cluster management. Manage both Kubernetes and Slurm workloads from a single platform.
Limitations
When scheduling Kubernetes Pods on the same Nodes as Slurm jobs, be aware of the following hardware limitations:
- Node sharing: Either a Slurm job or a Kubernetes Pod can run on a given Node, but not both at the same time.
- Memory availability: Avoid setting high memory requests for the
slurmd
container, as this can reduce scheduling flexibility for both Slurm jobs and Kubernetes Pods. When memory requests are too high, it limits how many workloads can be placed on a Node, leading to underutilization and longer queue or wait times. Configuring more conservative memory requests (e.g., 50% of the Node's memory) allows for more efficient use of resources across the cluster.
Check the scheduler scope
First, ensure the Scheduler component is set up to watch the namespaces where your Pods are created. The Scheduler may be configured to watch one or multiple namespaces, or even the entire Kubernetes cluster. Set this via the scheduler.scope.type
and scheduler.scope.namespaces
parameters.
To monitor the entire cluster, set scheduler.scope.type
to cluster
and ensure you have access to the scheduler's ClusterRole deployed with the SUNK chart.
To monitor specific namespaces, set scheduler.scope.type
to namespace
and set scheduler.scope.namespaces
to a comma-separated list of namespaces. If scheduler.scope.namespaces
is left blank, the scheduler will default to the namespace it's deployed in.
Pod configuration
In your Pod's spec
, assign spec.schedulerName
to match the name of the Slurm Scheduler. The default name is typically <namespace>-<releaseName>
, but can be manually set in the values with scheduler.name
, so check your deployment details to find the exact name.
You can check the value of the --scheduler-name
parameter on the Scheduler Pod:
$kubectl get pods -l app.kubernetes.io/name=sunk-scheduler -oyaml | yq '.items[0].spec.containers[] | select(.name == "scheduler").args'- --health-probe-bind-address=:8081- --metrics-bind-address=:8080- --hooks-api-bind-address=:8000- --zap-log-level=debug- --components- scheduler- --slurm-auth-token=$(SLURM_TOKEN)- --slurm-api-base=slurm-controller:6817- --slurm-kill-wait=30s- --watch-namespace=tenant-sta-example- --scheduler-name=tenant-sta-example-slurm-scheduler
In this example, the scheduler name is tenant-sta-example-slurm-scheduler
.
Your Pods must request specific CPU and memory resources. If these resource requests aren't set or are zero, the Pod will fail to be scheduled. If your workload requires GPUs, define GPU resource requests. You may also set a Node affinity for gpu.nvidia.com/class
to specify the GPU type. For example:
affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: gpu.nvidia.com/classoperator: Invalues:- A100_PCIE_80GB
Certain resources, such as CPU and memory, are shared between Kubernetes and Slurm. To avoid errors, configure your workload's resources with both requirements in mind. See Manage Resources for more information.
The Pod's terminationGracePeriodSeconds
must be less than the configured --slurm-kill-wait
value passed to the scheduler minus 5s. For example if --slurm-kill-wait=30s
then the maximum terminationGracePeriodSeconds
is 24
as 24 < 30 - 5
. Pods that fail this check will not be scheduled and need to be recreated with a valid terminationGracePeriodSeconds
.
When deploying via the Slurm chart, the value passed to the scheduler is configured to match the value of KillWait
in the Slurm config. If the Slurm config is modified outside of the chart the scheduler deployment will also need to be updated with the new value.
Optional annotations can be added to a Pod to configure Slurm-specific settings, such as the account to use, job priority, or the partition. Available annotations include:
annotations:sunk.coreweave.com/account: "root" # The account to use; list accounts with `sacctmgr list account`sunk.coreweave.com/comment: "A Kubernetes job" # A comment to add to the jobsunk.coreweave.com/constraints: # Constraints to use; see https://slurm.schedmd.com/sbatch.html#OPT_constraintsunk.coreweave.com/exclusive: "true" # Run job in exclusive mode; see https://slurm.schedmd.com/sbatch.html#OPT_exclusivesunk.coreweave.com/excluded-nodes: "a100-092-01,a100-092-02" # A comma separated list of Slurm nodes to excludesunk.coreweave.com/required-nodes: "a100-092-04,a100-092-06" # A comma separated list of Slurm nodes to requiresunk.coreweave.com/partition: "default" # The partition to usesunk.coreweave.com/prefer: # Preferences to use; see https://slurm.schedmd.com/sbatch.html#OPT_prefersunk.coreweave.com/priority: "100" # The integer priority to use; non-integer values will fail schedulingsunk.coreweave.com/qos: "normal" # The qos to usesunk.coreweave.com/reservation: "test" # The reservation to usesunk.coreweave.com/user-id: "1001" # The user-id to use; list users and user-ids with `id <user>`sunk.coreweave.com/group-id: "1002" # The group-id to use; list groups and group-ids with `id <user>`sunk.coreweave.com/current-working-dir: "/path/to/use" # The current working directory to use for the jobsunk.coreweave.com/timeout: "123" # The time limit (in seconds) for the Slurm placehold job; this must be an integer
Here's an example of a manifest for a Pod that uses a Slurm Scheduler:
apiVersion: v1kind: Podmetadata:name: slurm-testnamespace: tenant-slurm-stagingannotations:sunk.coreweave.com/account: rootsunk.coreweave.com/comment: "A Kubernetes job"spec:schedulerName: tenant-slurm-staging-slurm-scheduler # Formatted as <namespace>-<scheduler-name>containers:- name: mainimage: ubuntucommand:- /bin/bashargs:- -c- sleep infinityresources:limits:memory: 960Girdma/ib: "1"nvidia.com/gpu: "8"requests:cpu: "110"memory: 960Ginvidia.com/gpu: "8"terminationGracePeriodSeconds: 5affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: gpu.nvidia.com/modeloperator: Invalues:- H100_NVLINK_80GB
- Invalid annotation values cause the Pod to fail scheduling.
- Changes to annotations after a Pod is scheduled do not affect the job.
- Slurm's Scheduler is not aware of Kubernetes factors such as taints or required affinities. If there is a conflict between Slurm's Scheduler and a taint, for example, a Pod may become stuck after being bound.
- The Slurm job will default to using the root user (uid=0, gid=0) when scheduling a job. If a user-id is specified without a corresponding group-id, the group-id will set to the user-id.
Manage Resources
When scheduling Kubernetes Pods within your Slurm cluster Nodes, it's important to understand how GPU, CPU, and memory resources are allocated when Kubernetes Pods and Slurm jobs coexist on the same Node.
Shared Resources
Kubernetes Pods and Slurm jobs share CPU and memory resources. This allows you to schedule them alongside each other on the same Node, but can lead to errors if configured incorrectly.
To avoid conflicts, lower the resource requests for the slurmd
container.
In the following example, we add a Compute definition to the NodeSets to lower the CPU and memory requests:
# Lower the requests so k8s workloads can be scheduled onto the nodes.# Keep the limits high so slurm jobs can still use the full node.low-requests:resources:requests:cpu: 10memory: 10Gi
Always account for the shared nature of CPU and memory resources when configuring the Scheduler.
CPU
slurmd
requests subtract from the CPU resources available for Slurm jobs. If a Slurm job requires CPU resources beyond those available on the Kubernetes Pod after accounting for slurmd
requests, it can lead to an Out-of-CPU (OOCPU) error in Kubernetes.
Memory
If the slurmd
container is configured with high memory requests, it may prevent Kubernetes Pods from being scheduled on the same Node. Configure the slurmd
container with lower memory requests, such as 50% of available memory, to allow Kubernetes Pods to be scheduled alongside Slurm jobs.
Oversubscription
Out-of-memory (OOM) errors can occur if oversubscription is allowed. Oversubscription enables multiple Slurm jobs to run on the same Node, even when their combined memory requirements exceed the Node's memory capacity. To prevent OOM errors and ensure system stability, configure memory as a consumable resource by setting SelectTypeParameters
to CR_CPU_Memory
, CR_Core_Memory
, or CR_Socket_Memory
.
Setting SelectTypeParameters
to CR_CPU
, CR_Core
, or CR_Socket
is discouraged because these values allow memory oversubscription.
GPU resources
GPUs cannot be shared between Kubernetes Pods and Slurm jobs. The Kubernetes k8s-device-plugin and Slurm's GPU allocation mechanisms are incompatible, leading to conflicts if both are used simultaneously.
To avoid contention, it's recommended to use Slurm reservations when scheduling Kubernetes Pods that require GPUs. For example, you can create a Slurm reservation called gpu-ci
for Kubernetes Pods running CI jobs. This ensures that no other Slurm job can be assigned to that reservation unless explicitly requested with sbatch --reservation
.
Setting sunk.coreweave.com/exclusive
to true
in the Pod annotations also ensures that the Pod is scheduled on a Node with no other Slurm jobs running.
Time limits
A Slurm job's time limit impacts a Kubernetes Pod scheduled through Slurm differently than a normal Slurm job. The Kubernetes time limit begins when a Pod is scheduled onto a Node. It includes the time spent initializing the Pod before the container is ready and the prolog script starts. Slurm's time limit begins later, after the prolog script is complete and the job is fully running. The Slurm time limit can be specified using the sunk.coreweave.com/timeout
annotation.