About SUNK
Slurm on Kubernetes: highly-scalable cluster resource management and job scheduling for training workloads
Slurm on Kubernetes
To set up SUNK in your CKS cluster, please reach out to your CoreWeave representative.
Slurm is a fault-tolerant, highly scalable cluster resource management and job scheduling system that strives to be portable and interconnect-agnostic. Slurm is the de-facto scheduler for large HPC jobs in supercomputing centers, government laboratories, universities, and companies worldwide. It performs workload management for more than half of the fastest 10 systems on the TOP500 list.
SUNK (Slurm on Kubernetes) is an implementation of Slurm deployed on Kubernetes using Helm. It is a robust and adaptable solution designed to streamline the management of Slurm clusters within a scalable Kubernetes environment.
SUNK enables its users to:
- Run training and inference workloads on the same cluster by running Slurm jobs inside Kubernetes
- Share the same compute resources between workloads
- Choose which kinds of workloads run across all of the designated compute
This results in greater workload fungibility that increases resource efficiency and streamlines workload deployment.
How SUNK works with Kubernetes
SUNK integrates Slurm as a Kubernetes scheduler, which allows Slurm jobs to run inside Kubernetes. This architecture enables support for both burst and batch workloads on the same platform, and allows developers to leverage Slurm's resource management on Kubernetes.
Without SUNK, developers typically must choose between Slurm or Kubernetes, or else manage deployments of both independently. Managing Slurm and Kubernetes separately may reduce overall complexity, but it also greatly reduces workload flexibility, making it more difficult to maximize utilization of GPU resources.
SUNK also tightly integrates with the advanced hardware lifecycle management and observability system built into CoreWeave Kubernetes Service (CKS). Node operators running in CKS can take action on unhealthy nodes. That information is also propagated into SUNK, and vice-versa. All Slurm metrics and logs are pulled into CoreWeave Observability, providing a single-pane-of-glass experience.
Slurm | Kubernetes | SUNK | |
---|---|---|---|
Developer | SchedMD | Google (now Open Source) | CoreWeave |
Scheduler | The defacto scheduler for high-performance computing (HPC) workloads used by researchers, academics, and AI companies across the world | The Kubernetes scheduler lacks some functionality; other Kubernetes-based schedulers, such as Volcano and YuniKorn, aim to provide Slurm-like scheduling capabilities | SUNK brings Kubernetes containerized deployments and GitOps to Slurm and integrates a Slurm scheduler plugin to Kubernetes. |
Workloads | Designed for batch jobs that have a finite lifespan | Designed for long-running workloads, such as inference workloads | Designed for both burst and batch workloads on one central platform |
How is SUNK different to vanilla Slurm?
Traditional Slurm method
The traditional method of running Slurm is to run slurmd
nodes on either bare metal or virtual machines. Both of these options present challenges:
- Host systems must be provisioned in advance with the required runtime libraries.
- Development and production environments must stay in sync.
- Bare metal infrastructure is difficult to scale cost-effectively.
- Virtual machines scaling is slower than using Kubernetes containers.
- Virtual machines are slower due to the inherent virtualization overhead.
Slurm on Kubernetes
In contrast, Kubernetes containers are lightweight, start quickly, and can be scaled up and down in seconds.
- SUNK allows you to deploy Slurm on Kubernetes with GitOps and adds a Slurm scheduler plugin to Kubernetes. This allows Slurm jobs to run inside Kubernetes and share the same compute resources between Kubernetes and Slurm workloads for maximum efficiency.
- SUNK can use external Directory Services, such as Active Directory or OpenLDAP, and it supports Slurm Accounting backed by a MySQL database.
- SUNK can dynamically scale Slurm nodes to match the workload requirements.
- SUNK derives Slurm images from OCI container images and executes them on bare metal. CoreWeave has base images published in the SUNK project for different CUDA versions, including all dependencies for InfiniBand and SHARP.
SUNK inherits all the benefits of Kubernetes - for example, because Kubernetes restarts the Slurm controller if it fails the liveness or readiness probes, running two or more Slurm controllers is no longer necessary to achieve high availability.
Integration with Kubernetes also allows you to:
- Scale Slurm compute nodes up and down on demand
- Mount Persistent Volumes that share data across Slurm nodes and Kubernetes workloads
- Run native Kubernetes workloads from the Slurm scheduler to handle serverless and bursty workloads
- Mix different GPU types and CPU-only nodes in the same cluster using requests and limits for resource management
- Assign task affinity for Slurm driven CPU co-location
In SUNK, Slurm nodes run in Kubernetes Pods. These are not the same as Kubernetes Nodes, which are the worker machines that run the Pods. To distinguish between them in this article, Kubernetes Nodes are capitalized.
Configuration and deployment
With SUNK, each Slurm node is a Kubernetes Pod running a slurmd
container. SUNK uses declarative node definitions to define and scale a Slurm cluster. These node definitions expose Kubernetes resource requests, limits, node affinities, and replica counts. SUNK manages the required resources by deploying them as needed and scaling down when possible.
Each Slurm node has a mechanism for specifying the desired container image. This allows for the creation of lightweight, purpose-built containers that ensure consistency between environments and enable rapid scaling up or down.
SUNK packages everything needed to run a Slurm cluster in a configurable Helm chart that deploys Kubernetes resources. Using a standard GitOps workflow, this Helm chart can customize CoreWeave's published base images that preinstall CUDA, NCCL, InfiniBand, environment modules, and Conda. SUNK deploys not only the Slurm components like Accounting and Controller in Kubernetes, but also the topology, Prolog and Epilog scripts, and the Slurm configuration file.
The Helm chart also declares features like:
- s6 scripts and services
- SSO using any LDAP compatible IdP, such as OpenLDAP, Okta, Authentik, and Google Secure LDAP
- Linux cgroups for tracking and enforcement
Once SUNK is deployed, Kubernetes is abstracted away. Developers connect to a Slurm login node with SSH, and use familiar Slurm features - they do not need to know Kubernetes.
State management
When you deploy a Slurm cluster on top of Kubernetes with SUNK, it comes with a custom Kubernetes Scheduler. This custom Scheduler ensures that all events within the Slurm cluster are tracked in Kubernetes, and all events within Kubernetes are propagated to Slurm.
This allows Kubernetes workloads to be scheduled with the Slurm Scheduler. Within Slurm, Kubernetes jobs are visible running alongside normal Slurm jobs on the same hardware Nodes. When used this way, the Slurm Scheduler manages the preemption, partitions, and priorities for all workloads.
SUNK also handles:
- Automatic topology generation
- Pyxis container execution
- GRES auto-identification of resources
Key features
SUNK offers many key features that make it a powerful tool for managing Slurm clusters in a Kubernetes environment. Here are some of the most important features:
- Job Scheduling and Management: SUNK leverages Slurm for job scheduling and management within the Kubernetes environment. This includes features like job accounting, task plugins, and running Slurm jobs in containers.
- Kubernetes-Native Slurm Management: SUNK bridges the gap between Slurm and Kubernetes, enabling users to manage Slurm clusters using Kubernetes-native tools and practices. This includes deploying Slurm clusters as Kubernetes resources, using Helm charts for deployment, and leveraging Kubernetes features like ConfigMaps and Secrets.
- Topology-Aware Scheduling: SUNK leverages cluster topology information to optimize job scheduling and resource allocation. It supports Slurm's tree and block plugins for topology-aware scheduling.
MLOps features
- Task plugins: SUNK supports task plugins, which allow ML Researchers to customize job execution and gather metrics, adapting to the unique needs of their workload.
- Job scheduling and management: ML Researchers can leverage Slurm's powerful job scheduling capabilities within SUNK to manage their training jobs, experiments, and workflows, using the familiar Slurm interface without needing to learn a new system.
- Kubernetes integration: SUNK's integration with Kubernetes provides ML Researchers access to a wide range of tools and services, such as distributed storage, monitoring, and logging.
- Run scripts with s6: ML Researchers can use s6 scripts to install software, configure dependencies, and set up environments on Slurm nodes, ensuring a consistent and reproducible execution environment.
- Run Slurm jobs in containers: ML Researchers can run Slurm jobs inside containers for improved portability, reproducibility, and isolation of execution environments.
Infrastructure features
- Device plugins: SUNK supports Kubernetes device plugins, enabling Infrastructure Engineers to make system hardware resources, such as NVIDIA GPUs, available to Kubernetes.
- Sidecar containers: Infrastructure engineers can add sidecar containers to Slurm login and compute nodes to extend their functionality and enhance job execution.
- Linux cgroups: Infrastructure engineers can use Linux cgroups to manage resource limits, such as CPU, memory, and GPU usage, ensuring fair allocation and job isolation.
- Deployment management with CI and GitOps: Infrastructure engineers can use CI tools and GitOps practices to manage SUNK deployments, ensuring that clusters are always up to date and consistent.
- Prolog and Epilog scripts: Infrastructure engineers can use prolog and Epilog scripts to prepare the environment before job execution and clean up resources afterward, enhancing the job lifecycle management.
- Lua plugins: Infrastructure engineers can use Lua scripts to customize job submission and execution, tailoring the behavior of the Slurm cluster to their specific requirements.
To get started with SUNK, reach out to your CoreWeave representative.