SUNK (Slurm on Kubernetes)
Highly scalable cluster resource management and job scheduling
SUNK (SlUrm oN Kubernetes) is an implementation of Slurm which is deployed on Kubernetes via a Helm chart.
Slurm is a fault-tolerant, highly scalable cluster resource management and job scheduling system that strives to be portable and interconnect-agnostic. Slurm is the de-facto scheduler for large HPC jobs in supercomputing centers, government laboratories, universities, and companies worldwide. It performs workload management for more than half of the fastest 10 systems on the TOP500 list.
Quickstart
Ready to get started? See the SUNK Quickstart guide to learn how to deploy Slurm on SUNK.
SUNK Features
- SUNK allows you to deploy Slurm on Kubernetes with GitOps, and it adds a Slurm scheduler plugin to Kubernetes. This allows Slurm jobs to run inside Kubernetes and share the same compute resources between Kubernetes and Slurm workloads for maximum efficiency.
- SUNK can use external Directory Services such as Active Directory or OpenLDAP, and it has support for Slurm Accounting backed by a MySQL database.
- SUNK can dynamically scale Slurm nodes to match the workload requirements.
- In SUNK, Slurm images are derived from OCI container images and execute on bare metal. CoreWeave has base images published in the SUNK project for different CUDA versions, including all dependencies for InfiniBand and SHARP.
How is SUNK different?
The traditional means of running Slurm is to run slurmd
nodes on either bare metal or virtual machines. Both of these options have challenges:
- Host systems must be provisioned in advance with the required runtime libraries.
- Development and production environments must stay in sync.
- Bare metal infrastructure is difficult to scale cost-effectively.
- Virtual machines scaling is slower than using Kubernetes containers.
- Virtual machines are slower due to the inherent virtualization overhead.
In contrast, Kubernetes containers are lightweight, start quickly, and can be scaled up and down in seconds.
In SUNK, Slurm nodes run in Kubernetes Pods. These are not the same as Kubernetes Nodes, which are the worker machines that run the Pods. To distinguish between them in this article, Kubernetes Nodes are capitalized.
With SUNK, each Slurm node is a Kubernetes Pod running a slurmd
container. SUNK uses declarative node definitions to define and scale a Slurm cluster. These node definitions expose Kubernetes resource requests, limits, node affinities, and replica counts. SUNK manages the required resources by deploying them as needed and scaling down when possible.
Each Slurm node has a mechanism for specifying the desired container image. This allows for the creation of lightweight, purpose-built containers that ensure consistency between environments and enable rapid scaling up or down.
Configuration and deployment
SUNK packages everything needed to run a Slurm cluster in a configurable Helm chart that deploys Kubernetes resources. Using a standard GitOps workflow, this Helm chart can customize CoreWeave's published base images that preinstall CUDA, NCCL, InfiniBand, environment modules, and Conda. SUNK deploys not only the Slurm components like Accounting and Controller in Kubernetes, but also the topology, Prolog and Epilog scripts, and the Slurm configuration file.
The Helm chart also declares features like:
- s6 scripts and services
- SSO using any LDAP compatible IdP such as OpenLDAP, Okta, Authentik, and Google Secure LDAP
- Linux cgroups for tracking and enforcement
Kubernetes integration
SUNK inherits all the benefits of Kubernetes. Because Kubernetes restarts the Slurm controller if it fails the liveness or readiness probes, running two or more Slurm controllers to achieve high availability is no longer necessary.
Integration with Kubernetes also allows you to:
- Scale Slurm compute nodes up and down on demand
- Mount Persistent Volumes that share data across Slurm nodes and Kubernetes workloads
- Run native Kubernetes workloads from the Slurm scheduler to handle serverless and bursty workloads
- Mix different GPU types and CPU-only nodes in the same cluster using requests and limits for resource management
- Assign task affinity for Slurm driven CPU co-location
State management
When you deploy a Slurm cluster on top of Kubernetes with SUNK, it comes with a custom Kubernetes Scheduler. With this, everything that happens in the Slurm cluster is tracked in Kubernetes. Likewise, events that happen in Kubernetes are propagated to Slurm.
This allows Kubernetes workloads to be scheduled with the Slurm Scheduler. Within Slurm, Kubernetes jobs are visible running alongside normal Slurm jobs on the same hardware Nodes. When used this way, the Slurm Scheduler manages the preemption, partitions, and priorities for all workloads.
SUNK also handles:
- Automatic topology generation
- Pyxis container execution
- GRES auto-identification of resources
Familiar Slurm environment
Once SUNK is deployed, Kubernetes is abstracted away. Developers connect to a Slurm Login node with SSH and use the familiar Slurm features. SUNK supports Slurm Accounting with either a locally deployed slurmdbd
and MySQL, or a remote slurmdbd
. SUNK also uses Slurm partitions for resource allocation.
Next steps
See the SUNK Quickstart guide to learn how to deploy Slurm on SUNK.