SUNK (Slurm on Kubernetes)

Highly scalable cluster resource management and job scheduling

SUNK (SlUrm oN Kubernetes) is an implementation of Slurm which is deployed on Kubernetes via a Helm chart.

Slurm is a fault-tolerant, highly scalable cluster resource management and job scheduling system that strives to be portable and interconnect-agnostic. Slurm is the de-facto scheduler for large HPC jobs in supercomputing centers, government laboratories, universities, and companies worldwide. It performs workload management for more than half of the fastest 10 systems on the TOP500 list.

SUNK Features

SUNK allows you to deploy Slurm on Kubernetes with GitOps, and it adds a Slurm scheduler plugin to Kubernetes. This allows Slurm jobs to run inside Kubernetes and share the same compute resources between Kubernetes and Slurm workloads for maximum efficiency.
SUNK can use external Directory Services such as Active Directory or OpenLDAP, and it has support for Slurm Accounting backed by a MySQL database.
SUNK can dynamically scale Slurm nodes to match the workload requirements.
In SUNK, Slurm images are derived from OCI container images and execute on bare metal. CoreWeave has base images published in the SUNK project for different CUDA versions, including all dependencies for InfiniBand and SHARP.

How is SUNK different?

The traditional means of running Slurm is to run slurmd nodes on either bare metal or virtual machines. Both of these options have challenges:

Host systems must be provisioned in advance with the required runtime libraries.
Development and production environments must stay in sync.
Bare metal infrastructure is difficult to scale cost-effectively.
Virtual machines scaling is slower than using Kubernetes containers.
Virtual machines are slower due to the inherent virtualization overhead.

In contrast, Kubernetes containers are lightweight, start quickly, and can be scaled up and down in seconds.

Note

In SUNK, Slurm nodes run in Kubernetes Pods. These are not the same as Kubernetes Nodes, which are the worker machines that run the Pods. To distinguish between them in this article, Kubernetes Nodes are capitalized.

With SUNK, each Slurm node is a Kubernetes Pod running a slurmd container. SUNK uses declarative node definitions to define and scale a Slurm cluster. These node definitions expose Kubernetes resource requests, limits, node affinities, and replica counts. SUNK manages the required resources by deploying them as needed and scaling down when possible.

Each Slurm node has a mechanism for specifying the desired container image. This allows for the creation of lightweight, purpose-built containers that ensure consistency between environments and enable rapid scaling up or down.

Configuration and deployment

SUNK packages everything needed to run a Slurm cluster in a configurable Helm chart that deploys Kubernetes resources. Using a standard GitOps workflow, this Helm chart can customize CoreWeave's published base images that preinstall CUDA, NCCL, InfiniBand, environment modules, and Conda. SUNK deploys not only the Slurm components like Accounting and Controller in Kubernetes, but also the topology, Prolog and Epilog scripts, and the Slurm configuration file.

The Helm chart also declares features like:

s6 scripts and services
SSO using any LDAP compatible IdP such as OpenLDAP, Okta, Authentik, and Google Secure LDAP
Linux cgroups for tracking and enforcement

Kubernetes integration

SUNK inherits all the benefits of Kubernetes. Because Kubernetes restarts the Slurm controller if it fails the liveness or readiness probes, running two or more Slurm controllers to achieve high availability is no longer necessary.

Integration with Kubernetes also allows you to:

Scale Slurm compute nodes up and down on demand
Mount Persistent Volumes that share data across Slurm nodes and Kubernetes workloads
Run native Kubernetes workloads from the Slurm scheduler to handle serverless and bursty workloads
Mix different GPU types and CPU-only nodes in the same cluster using requests and limits for resource management
Assign task affinity for Slurm driven CPU co-location

State management

When you deploy a Slurm cluster on top of Kubernetes with SUNK, it comes with a custom Kubernetes Scheduler. With this, everything that happens in the Slurm cluster is tracked in Kubernetes. Likewise, events that happen in Kubernetes are propagated to Slurm.

This allows Kubernetes workloads to be scheduled with the Slurm Scheduler. Within Slurm, Kubernetes jobs are visible running alongside normal Slurm jobs on the same hardware Nodes. When used this way, the Slurm Scheduler manages the preemption, partitions, and priorities for all workloads.

SUNK also handles:

Automatic topology generation
Pyxis container execution
GRES auto-identification of resources

Familiar Slurm environment

Once SUNK is deployed, Kubernetes is abstracted away. Developers connect to a Slurm Login node with SSH and use the familiar Slurm features. SUNK supports Slurm Accounting with either a locally deployed slurmdbd and MySQL, or a remote slurmdbd. SUNK also uses Slurm partitions for resource allocation.

Next steps

If you are a current client looking to set up SUNK in your cluster, please reach out to your CoreWeave representative.

SUNK Features​

How is SUNK different?​

Configuration and deployment​

Kubernetes integration​

State management​

Familiar Slurm environment​

Next steps​