Train on SUNK

Leverage SUNK on CKS to train a model using Slurm

Outline

This long-form tutorial is comprised of the pages underneath this section. They are designed to be followed in the order they are numbered.

In this tutorial, you will:

🚀 What you'll need

Before you start, you must have:

A working SUNK cluster deployed on CKS with a GPU Node Pool.
The following tools installed in a local machine:
- Kubectl
- Git
- SSH and scp
Basic familiarity with Slurm.

Know before you go

Key concepts

Slurm/SUNK

Job: A compute workload submitted by a user. It can be a small task, requiring a single CPU for a few seconds, or a large job that requires thousands of CPUs and GPUs for days or even weeks.
Login nodes: The entry points for users to access the Slurm cluster. In SUNK they are Kubernetes Pods where users prepare data, submit jobs, and check job status. They are not intended for heavy computation and do not typically have a GPU.
Compute nodes: The machines that actually execute the jobs submitted by users. Kubernetes Pods that run slurmd and are mapped to physical Nodes in the cluster
Syncer: A Kubernetes Pod that synchronizes the Kubernetes state with the Slurm state.
Controller: A Kubernetes Pod that runs the Slurm controller, slurmctld, and also schedules jobs.

HPC cluster components

Network: Interconnects nodes for communication and data transfer. A CoreWeave HPC cluster usually has two networks: Ethernet, for normal communications including user sessions and storage, and InfiniBand, used for cases where performance is critical - for example, between nodes in a large training job.
Observability: Metrics that describe the performance of Slurm jobs and nodes, collected and consolidated for monitoring job and cluster performance using tools such as CoreWeave Grafana.
Storage: Provides space for data and applications. This can be either traditional file storage, or object storage, such as CoreWeave AI Object Storage.

Preinstalled software

The following tools are preinstalled on SUNK login nodes:

Miniconda:
- Initialize Miniconda in your shell using /opt/conda/bin/conda init bash.
Micromamba: A fast, lightweight alternative to conda.
- Initialize in your shell with micromamba shell init; source ~/.bashrc.
Java OpenJDK
s3cmd and aws cli, for interacting with object storage.
- For large object storage transfers, installing and using rclone or s5cmd in a container or conda environment is recommended.

Tip

The easiest way to develop on Slurm compute nodes is to create an interactive Slurm session and tunnel through VSCode.

Good to know

With the exception of SSH or scp commands, all commands in this tutorial are run on the Slurm login node.
Variables in code examples throughout this tutorial in all caps (for example, USERNAME) are placeholders. Replace them with your own values when running commands in your own environment.

Do not SSH into compute nodes

SSH is the preferred method to access Slurm login nodes. However, SSHing into Slurm compute nodes directly to run tasks is strongly discouraged - doing so is considered a bypass of Slurm, and can interfere with running jobs, cause nodes to drain unintentionally, or lead to a temporary loss of resources. SSHing into Slurm compute nodes should only be used for debugging purposes.

Additional resources

To learn more about Slurm, check out the following resources:

Outline​

Know before you go​

Key concepts​

Slurm/SUNK​

HPC cluster components​

Preinstalled software​

Good to know​

Additional resources​