> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Train on SUNK

> Multi-part tutorial for training a model on SUNK using Slurm, from cluster setup to job monitoring

This multi-part tutorial walks you through training a model on SUNK end to end, from setting up your first Slurm cluster to monitoring training jobs in Grafana. By the end, you'll have hands-on experience with the workflow most teams use to run training jobs on SUNK, including how to interact with Slurm, submit jobs, and observe their performance.

This tutorial is for users who are new to SUNK or to running Slurm jobs on Kubernetes, and who want a guided introduction before tackling production workloads.

## Outline

This long-form tutorial comprises the pages underneath this section. Follow them in the order they are numbered.

In this tutorial, you:

1. [Set up your first Slurm cluster](/products/sunk/tutorials/train-on-sunk/1-set-up-slurm-cluster).
2. [Submit your first training job, designed to introduce you to SUNK, Slurm, and its supporting utilities](/products/sunk/tutorials/train-on-sunk/2-submit-simple-job).
3. [Submit a more complex training job, which more closely reflects a real-world scenario](/products/sunk/tutorials/train-on-sunk/3-submit-a-training-job).
4. [Monitor training jobs using CoreWeave Grafana](/products/sunk/tutorials/train-on-sunk/4-monitor-jobs).

<Card title="What you'll need">
  Before you start, you must have:

  * A working SUNK cluster deployed on CKS with a GPU Node Pool.
  * The following tools installed in a local machine:
    * Git
    * SSH and `scp`
  * Basic familiarity with [Slurm](https://slurm.schedmd.com/overview.html).
</Card>

## Know before you go

Before you start the tutorial, review the key concepts, preinstalled software, and tips described in the following sections.

### Key concepts

The following sections describe the key Slurm, SUNK, and HPC cluster concepts used throughout this tutorial.

#### Slurm and SUNK

* **Job:** A compute workload submitted by a user. It can be a small task, requiring a single CPU for a few seconds, or a large job that requires thousands of CPUs and GPUs for days or even weeks.
* **Login nodes:** The entry points for users to access the Slurm cluster. In SUNK they are Kubernetes Pods where users prepare data, submit jobs, and check job status. They are not intended for heavy computation and do not typically have a GPU.
* **Compute nodes:** The machines that execute the jobs submitted by users. Kubernetes Pods that run `slurmd` and map to physical Nodes in the cluster.
* **Syncer:** A Kubernetes Pod that synchronizes the Kubernetes state with the Slurm state.
* **Controller:** A Kubernetes Pod that runs the Slurm controller, `slurmctld`, and also schedules jobs.

#### HPC cluster components

* **Network:** Interconnects nodes for communication and data transfer. A CoreWeave HPC cluster usually has two networks. Ethernet handles normal communications including user sessions and storage. InfiniBand handles cases where performance is critical, for example, between nodes in a large training job.
* **Observability:** Metrics that describe the performance of Slurm jobs and nodes, collected and consolidated for monitoring job and cluster performance using tools such as [CoreWeave Grafana](/observability/managed-grafana).
* **Storage:** Provides space for data and applications. This can be either traditional file storage, or object storage, such as [CoreWeave AI Object Storage](/products/storage/object-storage).

### Preinstalled software

The following tools are **preinstalled** on SUNK **login nodes**:

* [Miniconda](https://www.anaconda.com/docs/getting-started/miniconda/main):
  * Initialize Miniconda in your shell using `/opt/conda/bin/conda init bash`.
* [Micromamba](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html): A fast, lightweight alternative to conda.
  * Initialize in your shell with `micromamba shell init;  source ~/.bashrc`.
* Java OpenJDK
* `s3cmd` and `aws cli`, for interacting with [object storage](/products/storage/object-storage).
  * For large object storage transfers, install and use `rclone` or `s5cmd` in a container or `conda` environment. When you use s5cmd with [AI Object Storage](/products/storage/object-storage), use the [CoreWeave fork of s5cmd](https://github.com/coreweave/s5cmd). See [Migrate data to AI Object Storage](/products/storage/object-storage/migrate-data#migrate-data-with-s5cmd).

<Tip>
  The recommended way to develop on Slurm compute nodes is to create an interactive Slurm session and [tunnel through VS Code](/products/sunk/access_sunk/vs-code-with-slurm).
</Tip>

### Good to know

* Run all commands in this tutorial on the Slurm login node, except for SSH or `scp` commands.
* Variables in code examples throughout this tutorial in all caps (for example, `USERNAME`) are placeholders. Replace them with your own values when you run commands in your own environment.

<Danger>
  **Do not SSH into compute nodes**

  SSH is the preferred method to access Slurm **login nodes**. However, SSHing into Slurm **compute nodes** directly to run tasks is **strongly discouraged**. Doing so bypasses Slurm, and can interfere with running jobs, cause nodes to drain unintentionally, or lead to a temporary loss of resources. Use SSH into Slurm compute nodes only for debugging purposes.
</Danger>

## Third-party frameworks

This tutorial focuses on a single training workflow, but SUNK supports many ML frameworks. Any framework that runs on Slurm or in Linux containers works on SUNK, including PyTorch, TensorFlow, JAX, DeepSpeed, Megatron-LM, and others.

<Card title="Third-party frameworks" href="/products/cks/clusters/frameworks/introduction" arrow={true}>
  See popular AI and ML frameworks with links to CKS and SUNK guides.
</Card>

## Additional resources

The tutorial assumes basic familiarity with Slurm. To go deeper on Slurm itself, see the following resources:

* [SchedMD's official documentation for Slurm commands](https://slurm.schedmd.com/documentation.html)
* [SchedMD's convenient PDF command cheat-sheets](https://slurm.schedmd.com/pdfs/summary.pdf)
