SUNK Concepts Overview
SUNK is a robust and adaptable solution designed to streamline the management of Slurm clusters within a scalable Kubernetes environment. By bridging these two powerful technologies, SUNK empowers ML Researchers and Infrastructure Engineers with a comprehensive suite of features essential for AI workload orchestration. This Concepts section explains the core principles and functionalities that underpin SUNK to help you leverage its capabilities for your use cases.
Key SUNK features
SUNK offers many key features that make it a powerful tool for managing Slurm clusters in a Kubernetes environment. Here are some of the most important features:
- Job Scheduling and Management: SUNK leverages Slurm for job scheduling and management within the Kubernetes environment. This includes features like job accounting, task plugins, and running Slurm jobs in containers.
- Kubernetes-Native Slurm Management: SUNK bridges the gap between Slurm and Kubernetes, enabling users to manage Slurm clusters using Kubernetes-native tools and practices. This includes deploying Slurm clusters as Kubernetes resources, using Helm charts for deployment, and leveraging Kubernetes features like ConfigMaps and Secrets.
- Topology-Aware Scheduling: SUNK leverages cluster topology information to optimize job scheduling and resource allocation. It supports Slurm's tree and block plugins for topology-aware scheduling.
Important features for ML Researchers
- Task plugins: SUNK supports task plugins, which allow ML Researchers to customize job execution and gather metrics, adapting to the unique needs of their workload.
- Job scheduling and management: ML Researchers can leverage Slurm's powerful job scheduling capabilities within SUNK to manage their training jobs, experiments, and workflows, using the familiar Slurm interface without needing to learn a new system.
- Kubernetes integration: SUNK's integration with Kubernetes provides ML Researchers access to a wide range of tools and services, such as distributed storage, monitoring, and logging.
- Run scripts with s6: ML Researchers can use s6 scripts to install software, configure dependencies, and set up environments on Slurm nodes, ensuring a consistent and reproducible execution environment.
- Run Slurm jobs in containers: ML Researchers can run Slurm jobs inside containers for improved portability, reproducibility, and isolation of execution environments.
Important features for Infrastructure Engineers
- Device plugins: SUNK supports Kubernetes device plugins, enabling Infrastructure Engineers to make system hardware resources, such as NVIDIA GPUs, available to Kubernetes.
- Sidecar containers: Infrastructure engineers can add sidecar containers to Slurm login and compute nodes to extend their functionality and enhance job execution.
- Linux cgroups: Infrastructure engineers can use Linux cgroups to manage resource limits, such as CPU, memory, and GPU usage, ensuring fair allocation and job isolation.
- Deployment management with CI and GitOps: Infrastructure engineers can use CI tools and GitOps practices to manage SUNK deployments, ensuring that clusters are always up to date and consistent.
- Prolog and epilog scripts: Infrastructure engineers can use prolog and epilog scripts to prepare the environment before job execution and clean up resources afterward, enhancing the job lifecycle management.
- Lua plugins: Infrastructure engineers can use Lua scripts to customize job submission and execution, tailoring the behavior of the Slurm cluster to their specific requirements.