Skip to main content

Schedule Kubernetes Pods via Slurm

If your Slurm cluster includes a Scheduler component, you can manage Kubernetes Pod scheduling within your Slurm cluster nodes.

Check the scheduler scope

First, ensure the Scheduler component is set up to watch the namespaces where your Pods are created. The Scheduler may be configured to watch one or multiple namespaces, or even the entire Kubernetes cluster. This is set via the scheduler.scope.type and scheduler.scope.namespaces parameters.

To monitor the entire cluster, set scheduler.scope.type to cluster and ensure you have access to the scheduler's ClusterRole deployed with the SUNK chart.

To monitor specific namespaces, set scheduler.scope.type to namespace and set scheduler.scope.namespaces to a comma-separated list of namespaces. If scheduler.scope.namespaces is left blank, the scheduler will default to the namespace it's deployed in.

Pod configuration

In your Pod's spec, assign spec.schedulerName to match the name of the Slurm Scheduler. The default name is typically <namespace>-<releaseName>, but can be manually set in the values with scheduler.name, so check your deployment details to find the exact name.

You can check the value of the --scheduler-name parameter on the Scheduler Pod:

Example
$
kubectl get pods -l app.kubernetes.io/name=sunk-scheduler -oyaml | yq '.items[0].spec.containers[] | select(.name == "scheduler").args'
- --health-probe-bind-address=:8081
- --metrics-bind-address=:8080
- --hooks-api-bind-address=:8000
- --zap-log-level=debug
- --components
- scheduler
- --slurm-auth-token=$(SLURM_TOKEN)
- --slurm-api-base=slurm-controller:6817
- --slurm-kill-wait=30s
- --watch-namespace=tenant-sta-example
- --scheduler-name=tenant-sta-example-slurm-scheduler

In this example, the scheduler name is tenant-sta-example-slurm-scheduler.

Your Pods must request specific CPU and memory resources. If these resource requests aren't set or are zero, the Pod will fail to be scheduled. If your workload requires GPUs, define GPU resource requests. You may also set a Node affinity for gpu.nvidia.com/class to specify the GPU type. For example:

Example
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values:
- A100_PCIE_80GB

The Pod's terminationGracePeriodSeconds must be less than the configured --slurm-kill-wait value passed to the scheduler minus 5s. For example if --slurm-kill-wait=30s then the maximum terminationGracePeriodSeconds is 24 as 24 < 30 - 5. Pods that fail this check will not be scheduled and need to be recreated with a valid terminationGracePeriodSeconds.

When deploying via the Slurm chart, the value passed to the scheduler is configured to match the value of KillWait in the Slurm config. If the Slurm config is modified outside of the chart the scheduler deployment will also need to be updated with the new value.

Optional annotations can be added to a Pod to configure Slurm-specific settings, such as the account to use, job priority, or the partition. Available annotations include:

Example
annotations:
sunk.coreweave.com/account: "root" # The account to use; list accounts with `sacctmgr list account`
sunk.coreweave.com/comment: "A Kubernetes job" # A comment to add to the job
sunk.coreweave.com/constraints: # Constraints to use; see https://slurm.schedmd.com/sbatch.html#OPT_constraint
sunk.coreweave.com/exclusive: "true" # Run job in exclusive mode; see https://slurm.schedmd.com/sbatch.html#OPT_exclusive
sunk.coreweave.com/excluded-nodes: "a100-092-01,a100-092-02" # A comma separated list of Slurm nodes to exclude
sunk.coreweave.com/required-nodes: "a100-092-04,a100-092-06" # A comma separated list of Slurm nodes to require
sunk.coreweave.com/partition: "default" # The partition to use
sunk.coreweave.com/prefer: # Preferences to use; see https://slurm.schedmd.com/sbatch.html#OPT_prefer
sunk.coreweave.com/priority: "100" # The integer priority to use; non-integer values will fail scheduling
sunk.coreweave.com/qos: "normal" # The qos to use
sunk.coreweave.com/reservation: "test" # The reservation to use
sunk.coreweave.com/user-id: "1001" # The user-id to use; list users and user-ids with `id <user>`
sunk.coreweave.com/group-id: "1002" # The group-id to use; list groups and group-ids with `id <user>`
sunk.coreweave.com/current-working-dir: "/path/to/use" # The current working directory to use for the job
sunk.coreweave.com/timeout: "123" # The time limit (in seconds) for the Slurm placehold job; this must be an integer

Here's an example of a manifest for a Pod that uses a Slurm Scheduler:

Example
apiVersion: v1
kind: Pod
metadata:
name: slurm-test
namespace: tenant-slurm-staging
annotations:
sunk.coreweave.com/account: root
sunk.coreweave.com/comment: "A Kubernetes job"
spec:
schedulerName: tenant-slurm-staging-slurm-scheduler # Formatted as <namespace>-<scheduler-name>
containers:
- name: main
image: ubuntu
command:
- /bin/bash
args:
- -c
- sleep infinity
resources:
limits:
memory: 960Gi
rdma/ib: "1"
nvidia.com/gpu: "8"
requests:
cpu: "110"
memory: 960Gi
nvidia.com/gpu: "8"
terminationGracePeriodSeconds: 5
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/model
operator: In
values:
- H100_NVLINK_80GB
Important
  • Invalid annotation values cause the Pod to fail scheduling.
  • Changes to annotations after a Pod is scheduled do not affect the job.
  • Slurm's Scheduler is not aware of Kubernetes factors such as taints or required affinities. If there is a conflict between Slurm's Scheduler and a taint, for example, a Pod may become stuck after being bound.
  • The Slurm job will default to using the root user (uid=0, gid=0) when scheduling a job. If a user-id is specified without a corresponding group-id, the group-id will set to the user-id.

Resource Usage

When scheduling Kubernetes Pods within your Slurm cluster Nodes, it's important to understand how GPU, CPU, and memory resources are allocated when Kubernetes Pods and Slurm jobs coexist on the same Node.

GPU

GPUs cannot be shared between Kubernetes Pods and Slurm jobs. The Kubernetes k8s-device-plugin and Slurm's GPU allocation mechanisms are incompatible, leading to conflicts if both are used simultaneously.

To avoid contention, it's recommended to use Slurm reservations when scheduling Kubernetes Pods that require GPUs. For example, you can create a Slurm reservation called gpu-ci for Kubernetes Pods running CI jobs. This ensures that no other Slurm job can be assigned to that reservation unless explicitly requested with sbatch --reservation.

Setting sunk.coreweave.com/exclusive to true in the Pod annotations also ensures that the Pod is scheduled on a Node with no other Slurm jobs running.

CPU

CPUs are shared between Kubernetes Pods and Slurm jobs, allowing both to be scheduled on the same Node without conflict. However, to ensure Kubernetes Pods can be scheduled alongside Slurm jobs, the CPU resource requests for the slurmd container should be lowered.

Memory

Memory is shared between Kubernetes Pods and Slurm jobs. If the slurmd container is configured with high memory requests, it may prevent Kubernetes Pods from being scheduled on the same Node. Configuring the slurmd container with lower memory requests, such as 50% of available memory, allows Kubernetes Pods to be scheduled alongside Slurm jobs.

Oversubscription

Out-of-memory (OOM) errors can occur if oversubscription is allowed. Oversubscription enables multiple Slurm jobs to run on the same Node, even when their combined memory requirements exceed the Node's memory capacity. To prevent OOM errors and ensure system stability, configure memory as a consumable resource by setting SelectTypeParameters to CR_CPU_Memory, CR_Core_Memory, or CR_Socket_Memory.

Warning

Setting SelectTypeParameters to CR_CPU, CR_Core, or CR_Socket is discouraged because these values allow memory oversubscription.

Time limits

A Slurm job's time limit impacts a Kubernetes Pod scheduled through Slurm differently than a normal Slurm job. The Kubernetes time limit begins when a Pod is scheduled onto a Node. It includes the time spent initializing the Pod before the container is ready and the prolog script starts. Slurm's time limit begins later, after the prolog script is complete and the job is fully running. The Slurm time limit can be specified using the sunk.coreweave.com/timeout annotation.