Schedule Kubernetes Pods via Slurm
If your Slurm cluster includes a Scheduler component, you can manage Kubernetes Pod scheduling within your Slurm cluster nodes.
Check the scheduler scope
First, ensure the Scheduler component is set up to watch the namespaces where your Pods are created. The Scheduler may be configured to watch one or multiple namespaces, or even the entire Kubernetes cluster. This is set via the scheduler.scope.type
and scheduler.scope.namespaces
parameters.
To monitor the entire cluster, set scheduler.scope.type
to cluster
and ensure you have access to the scheduler's ClusterRole deployed with the SUNK chart.
To monitor specific namespaces, set scheduler.scope.type
to namespace
and set scheduler.scope.namespaces
to a comma-separated list of namespaces. If scheduler.scope.namespaces
is left blank, the scheduler will default to the namespace it's deployed in.
Pod configuration
In your Pod's spec
, assign spec.schedulerName
to match the name of the Slurm Scheduler. The default name is typically <namespace>-<releaseName>
, but can be manually set in the values with scheduler.name
, so check your deployment details to find the exact name.
You can check the value of the --scheduler-name
parameter on the Scheduler Pod:
$kubectl get pods -l app.kubernetes.io/name=sunk-scheduler -oyaml | yq '.items[0].spec.containers[] | select(.name == "scheduler").args'- --health-probe-bind-address=:8081- --metrics-bind-address=:8080- --hooks-api-bind-address=:8000- --zap-log-level=debug- --components- scheduler- --slurm-auth-token=$(SLURM_TOKEN)- --slurm-api-base=slurm-controller:6817- --slurm-kill-wait=30s- --watch-namespace=tenant-sta-example- --scheduler-name=tenant-sta-example-slurm-scheduler
In this example, the scheduler name is tenant-sta-example-slurm-scheduler
.
Your Pods must request specific CPU and memory resources. If these resource requests aren't set or are zero, the Pod will fail to be scheduled. If your workload requires GPUs, define GPU resource requests. You may also set a Node affinity for gpu.nvidia.com/class
to specify the GPU type. For example:
affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: gpu.nvidia.com/classoperator: Invalues:- A100_PCIE_80GB
The Pod's terminationGracePeriodSeconds
must be less than the configured --slurm-kill-wait
value passed to the scheduler minus 5s. For example if --slurm-kill-wait=30s
then the maximum terminationGracePeriodSeconds
is 24
as 24 < 30 - 5
. Pods that fail this check will not be scheduled and need to be recreated with a valid terminationGracePeriodSeconds
.
When deploying via the Slurm chart, the value passed to the scheduler is configured to match the value of KillWait
in the Slurm config. If the Slurm config is modified outside of the chart the scheduler deployment will also need to be updated with the new value.
Optional annotations can be added to a Pod to configure Slurm-specific settings, such as the account to use, job priority, or the partition. Available annotations include:
annotations:sunk.coreweave.com/account: "root" # The account to use; list accounts with `sacctmgr list account`sunk.coreweave.com/comment: "A Kubernetes job" # A comment to add to the jobsunk.coreweave.com/constraints: # Constraints to use; see https://slurm.schedmd.com/sbatch.html#OPT_constraintsunk.coreweave.com/exclusive: "true" # Run job in exclusive mode; see https://slurm.schedmd.com/sbatch.html#OPT_exclusivesunk.coreweave.com/excluded-nodes: "a100-092-01,a100-092-02" # A comma separated list of Slurm nodes to excludesunk.coreweave.com/required-nodes: "a100-092-04,a100-092-06" # A comma separated list of Slurm nodes to requiresunk.coreweave.com/partition: "default" # The partition to usesunk.coreweave.com/prefer: # Preferences to use; see https://slurm.schedmd.com/sbatch.html#OPT_prefersunk.coreweave.com/priority: "100" # The integer priority to use; non-integer values will fail schedulingsunk.coreweave.com/qos: "normal" # The qos to usesunk.coreweave.com/reservation: "test" # The reservation to usesunk.coreweave.com/user-id: "1001" # The user-id to use; list users and user-ids with `id <user>`sunk.coreweave.com/group-id: "1002" # The group-id to use; list groups and group-ids with `id <user>`sunk.coreweave.com/current-working-dir: "/path/to/use" # The current working directory to use for the jobsunk.coreweave.com/timeout: "123" # The time limit (in seconds) for the Slurm placehold job; this must be an integer
Here's an example of a manifest for a Pod that uses a Slurm Scheduler:
apiVersion: v1kind: Podmetadata:name: slurm-testnamespace: tenant-slurm-stagingannotations:sunk.coreweave.com/account: rootsunk.coreweave.com/comment: "A Kubernetes job"spec:schedulerName: tenant-slurm-staging-slurm-scheduler # Formatted as <namespace>-<scheduler-name>containers:- name: mainimage: ubuntucommand:- /bin/bashargs:- -c- sleep infinityresources:limits:memory: 960Girdma/ib: "1"nvidia.com/gpu: "8"requests:cpu: "110"memory: 960Ginvidia.com/gpu: "8"terminationGracePeriodSeconds: 5affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: gpu.nvidia.com/modeloperator: Invalues:- H100_NVLINK_80GB
- Invalid annotation values cause the Pod to fail scheduling.
- Changes to annotations after a Pod is scheduled do not affect the job.
- Slurm's Scheduler is not aware of Kubernetes factors such as taints or required affinities. If there is a conflict between Slurm's Scheduler and a taint, for example, a Pod may become stuck after being bound.
- The Slurm job will default to using the root user (uid=0, gid=0) when scheduling a job. If a user-id is specified without a corresponding group-id, the group-id will set to the user-id.
Resource Usage
When scheduling Kubernetes Pods within your Slurm cluster Nodes, it's important to understand how GPU, CPU, and memory resources are allocated when Kubernetes Pods and Slurm jobs coexist on the same Node.
GPU
GPUs cannot be shared between Kubernetes Pods and Slurm jobs. The Kubernetes k8s-device-plugin and Slurm's GPU allocation mechanisms are incompatible, leading to conflicts if both are used simultaneously.
To avoid contention, it's recommended to use Slurm reservations when scheduling Kubernetes Pods that require GPUs. For example, you can create a Slurm reservation called gpu-ci
for Kubernetes Pods running CI jobs. This ensures that no other Slurm job can be assigned to that reservation unless explicitly requested with sbatch --reservation
.
Setting sunk.coreweave.com/exclusive
to true
in the Pod annotations also ensures that the Pod is scheduled on a Node with no other Slurm jobs running.
CPU
CPUs are shared between Kubernetes Pods and Slurm jobs, allowing both to be scheduled on the same Node without conflict. However, to ensure Kubernetes Pods can be scheduled alongside Slurm jobs, the CPU resource requests for the slurmd
container should be lowered.
Memory
Memory is shared between Kubernetes Pods and Slurm jobs. If the slurmd
container is configured with high memory requests, it may prevent Kubernetes Pods from being scheduled on the same Node. Configuring the slurmd
container with lower memory requests, such as 50% of available memory, allows Kubernetes Pods to be scheduled alongside Slurm jobs.
Oversubscription
Out-of-memory (OOM) errors can occur if oversubscription is allowed. Oversubscription enables multiple Slurm jobs to run on the same Node, even when their combined memory requirements exceed the Node's memory capacity. To prevent OOM errors and ensure system stability, configure memory as a consumable resource by setting SelectTypeParameters
to CR_CPU_Memory
, CR_Core_Memory
, or CR_Socket_Memory
.
Setting SelectTypeParameters
to CR_CPU
, CR_Core
, or CR_Socket
is discouraged because these values allow memory oversubscription.
Time limits
A Slurm job's time limit impacts a Kubernetes Pod scheduled through Slurm differently than a normal Slurm job. The Kubernetes time limit begins when a Pod is scheduled onto a Node. It includes the time spent initializing the Pod before the container is ready and the prolog script starts. Slurm's time limit begins later, after the prolog script is complete and the job is fully running. The Slurm time limit can be specified using the sunk.coreweave.com/timeout
annotation.