Schedule Kubernetes Pods with Slurm

This conceptual explanation is for customers who want to schedule Kubernetes Pods with the Slurm Scheduler and learn about the benefits of scheduling Kubernetes Pods with Slurm.

If your Slurm cluster includes a Scheduler component (which is part of SUNK by default), you can manage Kubernetes Pod scheduling within your Slurm cluster nodes, allowing for a unified approach to workload management.

Info

Kubernetes Nodes are capitalized as proper nouns in this documentation, while Slurm nodes are lowercase.

In SUNK, each Slurm node is a Kubernetes Pod running a slurmd container. This design enables dynamic provisioning and scaling of Slurm nodes using native Kubernetes mechanisms. Note that Slurm nodes (Pods) are distinct from Kubernetes Nodes, which are the underlying worker machines that host these Pods.

In a typical research cluster, jobs are submitted to Slurm using its familiar Slurm commands like srun and sinfo. However, some workloads, such as real-time inference, are better suited to run as standalone Kubernetes Pods. Managing Kubernetes and Slurm clusters separately can be cumbersome, and dynamically moving Nodes between them is often impractical. To address this, SUNK includes the SUNK Scheduler, a custom Kubernetes scheduler developed by CoreWeave, which allows users to schedule both Slurm jobs and Kubernetes Pods in the same cluster—for example, training and inference workloads coexisting seamlessly. The SUNK Scheduler is enabled by default.

As a prerequisite for scheduling Kubernetes Pods with Slurm, the Slurm cluster must be deployed with the SUNK Scheduler. When a Pod is marked for scheduling via the SUNK Scheduler, it creates a placeholder job in Slurm that the Slurm scheduler can manage, then allocates the Pod to the Node selected by Slurm.

The SUNK Scheduler operates as a conventional Kubernetes reconciler, watching Pod objects and events from the Slurm Controller. It ensures that Kubernetes Pods are scheduled on the same Kubernetes Nodes as Slurm nodes; that the state of the Slurm cluster is synchronized with Kubernetes; and that events in Kubernetes are propagated into Slurm. It can preempt Kubernetes workloads in favor of Slurm jobs running on the same Node, or vice versa, using the same logic.

Benefits

Kubernetes and Slurm are both powerful workload management and scheduling tools, each with different strengths. Kubernetes excels at container orchestration and microservice management, while Slurm is designed for high-performance computing (HPC) workloads and job scheduling.

By using SUNK to integrate Slurm with Kubernetes, you can:

Leverage the strengths of both systems. Use Kubernetes for container orchestration and Slurm for HPC job scheduling.
Gain flexibility. Run Kubernetes and Slurm workloads side by side in the same cluster, such as training with Slurm and serving inference with Kubernetes, while sharing compute resources to maximize efficiency.
Simplify cluster management. Manage both Kubernetes and Slurm workloads from a single platform.

Limitations

When scheduling Kubernetes Pods on the same Nodes as Slurm jobs, be aware of the following hardware limitations:

Node sharing: Either a Slurm job or a Kubernetes Pod can run on a given Node, but not both at the same time.
Memory availability: Avoid setting high memory requests for the slurmd container, as this can reduce scheduling flexibility for both Slurm jobs and Kubernetes Pods. When memory requests are too high, it limits how many workloads can be placed on a Node, leading to underutilization and longer queue or wait times. Configuring more conservative memory requests (e.g., 50% of the Node's memory) allows for more efficient use of resources across the cluster.

Check the scheduler scope

First, ensure the Scheduler component is set up to watch the namespaces where your Pods are created. The Scheduler may be configured to watch one or multiple namespaces, or even the entire Kubernetes cluster. Set this via the scheduler.scope.type and scheduler.scope.namespaces parameters.

To monitor the entire cluster, set scheduler.scope.type to cluster and ensure you have access to the scheduler's ClusterRole deployed with the SUNK chart.

To monitor specific namespaces, set scheduler.scope.type to namespace and set scheduler.scope.namespaces to a comma-separated list of namespaces. If scheduler.scope.namespaces is left blank, the scheduler will default to the namespace it's deployed in.

Pod configuration

In your Pod's spec, assign spec.schedulerName to match the name of the Slurm Scheduler. The default name is typically <namespace>-<releaseName>, but can be manually set in the values with scheduler.name, so check your deployment details to find the exact name.

You can check the value of the --scheduler-name parameter on the Scheduler Pod:

Example

$
kubectl get pods -l app.kubernetes.io/name=sunk-scheduler -oyaml | yq '.items[0].spec.containers[] | select(.name == "scheduler").args'
- --health-probe-bind-address=:8081
- --metrics-bind-address=:8080
- --hooks-api-bind-address=:8000
- --zap-log-level=debug
- --components
- scheduler
- --slurm-auth-token=$(SLURM_TOKEN)
- --slurm-api-base=slurm-controller:6817
- --slurm-kill-wait=30s
- --watch-namespace=slurm-staging
- --scheduler-name=tenant-slurm-staging-slurm-scheduler

In this example, the scheduler name is tenant-slurm-staging-slurm-scheduler.

Your Pods must request specific CPU and memory resources. If these resource requests aren't set or are zero, the Pod will fail to be scheduled. If your workload requires GPUs, define GPU resource requests. You may also set a Node affinity for gpu.nvidia.com/class to specify the GPU type. For example:

Example

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: gpu.nvidia.com/class
          operator: In
          values:
            - A100_PCIE_80GB

Important

Certain resources, such as CPU and memory, are shared between Kubernetes and Slurm. To avoid errors, configure your workload's resources with both requirements in mind. See Manage Resources for more information.

The Pod's terminationGracePeriodSeconds must be less than the configured --slurm-kill-wait value passed to the scheduler minus 5s. For example if --slurm-kill-wait=30s then the maximum terminationGracePeriodSeconds is 24 as 24 < 30 - 5. Pods that fail this check will not be scheduled and need to be recreated with a valid terminationGracePeriodSeconds.

When deploying via the Slurm chart, the value passed to the scheduler is configured to match the value of KillWait in the Slurm config. If the Slurm config is modified outside of the chart the scheduler deployment will also need to be updated with the new value.

Scheduling Kubernetes Pods with annotations

Optional annotations can be added to a Pod to configure Slurm-specific settings, such as the account to use, job priority, or the partition. Available annotations include:

SUNK Annotations Reference

The following annotations can be used to configure Slurm job parameters when scheduling Kubernetes Pods through SUNK.

Annotation	Type	Description	Example Value
`sunk.coreweave.com/account`	string	The Slurm account to use for the job. List available accounts with `sacctmgr list account`.	`"root"`
`sunk.coreweave.com/comment`	string	A descriptive comment to add to the Slurm job.	`"A Kubernetes job"`
`sunk.coreweave.com/constraints`	string	Node constraints for job placement. See the Slurm sbatch documentation for syntax.
For Sunk v5.7.0 and later: `sunk.coreweave.com/exclusive`	string	Run the job in exclusive mode, preventing other jobs from sharing the same nodes. See the Slurm exclusive documentation. Note: Refer to GPU resources to avoid problems when requesting GPU resources.	`none`, `ok`, `user`, `mcs`, `topo`
For Sunk before v.5.7.0: `sunk.coreweave.com/exclusive`	boolean	Run the job in exclusive mode, preventing other jobs from sharing the same nodes. See the Slurm exclusive documentation. Note: Refer to GPU resources to avoid problems when requesting GPU resources.	`"true"`
`sunk.coreweave.com/excluded-nodes`	string	Comma-separated list of Slurm node names to exclude from job placement.	`"a100-092-01,a100-092-02"`
`sunk.coreweave.com/required-nodes`	string	Comma-separated list of Slurm node names required for job placement.	`"a100-092-04,a100-092-06"`
`sunk.coreweave.com/partition`	string	The Slurm partition to submit the job to.	`"default"`
`sunk.coreweave.com/prefer`	string	Node preferences for job placement. See the Slurm prefer documentation for syntax.
`sunk.coreweave.com/priority`	integer	Job priority value. Must be an integer; non-integer values will cause scheduling to fail.	`"100"`
`sunk.coreweave.com/qos`	string	Quality of Service (QoS) level for the job.	`"normal"`
`sunk.coreweave.com/reservation`	string	Slurm reservation name to use for the job.	`"test"`
`sunk.coreweave.com/user-id`	integer	User ID to run the job as. List users and IDs with `id <user>`.	`"1001"`
`sunk.coreweave.com/group-id`	integer	Group ID to run the job as. List groups and IDs with `id <user>`.	`"1002"`
`sunk.coreweave.com/current-working-dir`	string	Working directory path for the job execution.	`"/path/to/use"`
`sunk.coreweave.com/timeout`	integer	Time limit in seconds for the Slurm placeholder job. Must be an integer value.	`"123"`

Here's an example of a manifest for a Pod that runs exclusively on a node, allowing no other Pods or Jobs to schedule:

Example

apiVersion: v1
kind: Pod
metadata:
  name: slurm-test
  namespace: tenant-slurm-staging
  annotations:
    sunk.coreweave.com/account: root
    sunk.coreweave.com/comment: "A Kubernetes job"
    sunk.coreweave.com/exclusive: "none"
spec:
  schedulerName: tenant-slurm-staging-slurm-scheduler # Formatted as <namespace>-<scheduler-name>
  containers:
    - name: main
      image: ubuntu
      command:
        - /bin/bash
      args:
        - -c
        - sleep infinity
      resources:
        limits:
          memory: 960Gi
          rdma/ib: "1"
          nvidia.com/gpu: "8"
        requests:
          cpu: "110"
          memory: 960Gi
          nvidia.com/gpu: "8"
  terminationGracePeriodSeconds: 5
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: gpu.nvidia.com/model
                operator: In
                values:
                  - H100_NVLINK_80GB

Important

Invalid annotation values cause the Pod to fail scheduling.
Changes to annotations after a Pod is scheduled do not affect the job.
Slurm's Scheduler is not aware of Kubernetes factors such as taints or required affinities. If there is a conflict between Slurm's Scheduler and a taint, for example, a Pod may become stuck after being bound with kubelet rejecting it. Errors will be visible when describing the pod.
The Slurm job will default to using the root user (uid=0, gid=0) when scheduling a job. If a user-id is specified without a corresponding group-id, the group-id will set to the user-id.

Manage Resources

When scheduling Kubernetes Pods within your Slurm cluster Nodes, it's important to understand how GPU, CPU, and memory resources are allocated when Kubernetes Pods and Slurm jobs coexist on the same Node.

Shared Resources

Kubernetes Pods and Slurm jobs share CPU and memory resources. This allows you to schedule them alongside each other on the same Node, but can lead to errors if configured incorrectly.

To avoid conflicts, lower the resource requests for the slurmd container. The default SUNK NodeSets have to be changed to effectively schedule Kubernetes Pods via the SUNK scheduler. Failing to change the requests will lead to OutOfMemory and OutOfcpu rejections of Pods by the kubelet.

Add a compute definition to the NodeSets to lower the CPU and memory requests:

Example

    # Lower the requests so k8s workloads can be scheduled onto the nodes.
    # Keep the limits high so slurm jobs can still use the full node.
    low-requests:
      resources:
        requests:
          cpu: 10
          memory: 10Gi

Warning

Always account for the shared nature of CPU and memory resources when configuring the Scheduler.

CPU

slurmd requests subtract from the CPU resources available for Pods scheduled via the Slurm Scheduler integration. If a Kubernetes Pod requires CPU resources beyond those available on the Kubernetes Pod after accounting for slurmd requests, it can lead to an OutOfcpu rejection by the kubelet.

Memory

If the slurmd container is configured with high memory requests, it may prevent Kubernetes Pods from being scheduled on the same Node. Configure the slurmd container with lower memory requests, such as 10Gi, to allow Kubernetes Pods to be scheduled alongside Slurm jobs. OutOfMemory kubelet rejections of pods can still happen if packing the node completely full via the scheduler. MemSpecLimit can be used in the slurm NodeSet configuration to reserve memory for slurmd and other system Pods (such as DaemonSets). To see an example manifest for changing slurmd resources, refer to Configure Compute nodes.

The Slurm scheduler will not know about any pods scheduled using the standard Kubernetes scheduler. Scheduling pods (except for DaemonSets) on nodes running Slurm is highly discouraged.

Oversubscription

Out-of-memory (OOM) errors can occur if oversubscription is allowed. Oversubscription enables multiple Slurm jobs to run on the same Node, even when their combined memory requirements exceed the Node's memory capacity. To prevent OOM errors and ensure system stability, configure memory as a consumable resource by setting SelectTypeParameters to CR_CPU_Memory, CR_Core_Memory, or CR_Socket_Memory.

Warning

Setting SelectTypeParameters to CR_CPU, CR_Core, or CR_Socket is discouraged because these values allow memory oversubscription.

GPU resources

Sharing GPU resources

GPUs on the same node cannot be shared between Kubernetes Pods and Slurm jobs. The Kubernetes k8s-device-plugin and Slurm's GPU allocation mechanisms are incompatible, leading to conflicts if both are used simultaneously.

When scheduling Kubernetes Pods using GPUs, you have the following options:

When a Pod uses all GPUs in a node, or it is otherwise acceptable to reserve the full node for the job, use sunk.coreweave.com/exclusive: "none".
To dedicate a specific set of nodes for Kubernetes Pods, you can create a reservation in the Slurm and Kubernetes pods schedule with the following reservation annotation: sunk.coreweave.com/reservation. This annotation also ensures that the Pod is scheduled on a Node with no other Slurm jobs running.
Create a dedicated user in your Slurm environment for Kubernetes Pods, and set exclusivity to be per user, as well as supply the user-id of the Kubernetes user to all Kubernetes Pods. This will allow only other Kubernetes Pods to share the same node and allocate GPUs. This is the best option when running Kubernetes Pods that request less than a full node's worth of GPUs.

Example

apiVersion: v1
kind: Pod
metadata:
  name: slurm-test
  namespace: tenant-slurm-staging
  annotations:
    sunk.coreweave.com/account: root
    sunk.coreweave.com/comment: "A single GPU Pod"
    sunk.coreweave.com/exclusive: "user"
    sunk.coreweave.com/user-id: "1001" # The user id of the special Slurm user created for use by Kubernetes jobs
spec:
  schedulerName: tenant-slurm-staging-slurm-scheduler # Formatted as <namespace>-<scheduler-name>
  containers:
    - name: main
      image: ubuntu
      command:
        - /bin/bash
      args:
        - -c
        - sleep infinity
      resources:
        limits:
          memory: 256Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "8"
          memory: 128Gi
          nvidia.com/gpu: "1"
  terminationGracePeriodSeconds: 5
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: gpu.nvidia.com/model
                operator: In
                values:
                  - H100_NVLINK_80GB

Time limits

A Slurm job's time limit impacts a Kubernetes Pod scheduled through Slurm differently than a normal Slurm job. The Kubernetes time limit begins when a Pod is scheduled onto a Node. It includes the time spent initializing the Pod before the container is ready and the prolog script starts. Slurm's time limit begins later, after the prolog script is complete and the job is fully running. The Slurm time limit can be specified using the sunk.coreweave.com/timeout annotation.

Benefits​

Limitations​

Check the scheduler scope​

Pod configuration​

Scheduling Kubernetes Pods with annotations​

SUNK Annotations Reference​

Manage Resources​

Shared Resources​

CPU​

Memory​

Oversubscription​

GPU resources​

Time limits​

Benefits

Limitations

Check the scheduler scope

Pod configuration

Scheduling Kubernetes Pods with annotations

SUNK Annotations Reference

Manage Resources

Shared Resources

CPU

Memory

Oversubscription

GPU resources

Time limits