Skip to main content

1. Set up a Slurm cluster

Connect to the Slurm login node with SSH

Note

User access to Slurm clusters is managed via customer Identity Providers (IdPs), federated into a cluster-side directory service. Access is granted based on allowed users' public SSH keys. For assistance, please contact CoreWeave support.

Acquire the IP address or DNS record of the login node

The slurm-login Service on your CKS cluster provides information about how to connect to the login node, including the external IP address or DNS record, if there is one.

To get the login service's IP address or DNS record, use kubectl get svc slurm-login.

Obtain the External IP address
$
kubectl get svc slurm-login
Example output
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
slurm-login LoadBalancer 192.0.2.100 203.0.113.100 22/TCP 2d21h
The `EXTERNAL-IP` field in the output is the target IP address for SSH access.

SSH in to the login node using either this IP address, or the DNS record created for the login node if there is one.

Log in with SSH
$
ssh -i /path/to/ed25519_private_key USERNAME@EXTERNAL-IP
See an example: SSH

In this example, the username is exampleuser, and the target IP address is 203.0.113.100.

Example: Log in with SSH
$
ssh -i ~/.ssh/ed25519-key [email protected]

On successful login, a welcome message will be presented, along with a command prompt:

Example
Welcome to a CoreWeave Slurm HPC Cluster
USERNAME@slurm-login-0:~$

You are now logged into the Slurm login node, and from here you can start to run Slurm commands.

Verify the Slurm cluster

To verify that your Slurm cluster is working as expected, run a simple Slurm job. One simple task is to discover the hostname on 6 nodes using srun:

Example: Find the hostname on 6 nodes
$
srun -N 6 hostname
Example output
slurm-rtx4000-3
slurm-rtx4000-1
slurm-rtx4000-0
slurm-cpu-epyc-0
slurm-cpu-epyc-1
slurm-rtx4000-2

Verify Data Access

Before you can run a training job, you'll need to transfer your data into the cluster's persistent storage. POSIX-compliant shared storage on your cluster is persistent Distributed File Storage (DFS). Anything stored in DFS is available across all the nodes of your Slurm cluster, and remains even if nodes are replaced.

Important

Any data placed on a node outside DFS will not persist through node restarts or replaced nodes.

Shared DFS storage is mounted on login and compute nodes. Typically, data are mounted in /mnt, in directories such as /mnt/home and /mnt/data.

Check available DFS space

See how much space is available with user -H.

See an example: user -H

In this example, the mount directories are /mnt/data and /mnt/home.

Example: Check available space
$
user -H /mnt/data /mnt/home
Example output
Filesystem Size Used Avail Use% Mounted on
100.121.2.187:/k8s/pvc-7236f9d2-d948-42b1-b909-ff634723fcc2 11T 0 11T 0% /mnt/data
100.121.2.182:/k8s/pvc-f1ada11a-11f3-4b73-84f0-8b01e8c4bfae 1.1T 4.2M 1.1T 1% /mnt/home

Copy files from the local machine to the Slurm cluster

Recommended tools for transferring data are scp or rsync.

Using rsync is typically faster when working with large directories. If interrupted, pick up where you left off by reissuing the command.

See an example: rsync and scp

To copy data from the local machine to the remote machine, use these commands:

Using rsync:

Example
$
rsync -avz -e "ssh -i ~/.ssh/id_ed25519" ./data/ USERNAME@IP_ADDRESS:/remote/path/

Using scp:

Example
$
scp -i /path/to/ed_25519-private-key /path/to/local-file-to-copy USERNAME@IP_ADDRESS:/remote/path

To copy data from the remote machine to the local machine, reverse the commands:

Using rsync:

Example
$
rsync -avz -e "ssh -i ~/.ssh/id_ed25519" USERNAME@IP_ADDRESS:/remote/path/ local-path

Using scp:

Example
$
scp -i /path/to/ed_25519-private-key USERNAME@IP_ADDRESS:remote-file-to-copy local-path

Install software on the Slurm cluster

Important

It is not considered best practice to install software directly onto Slurm compute nodes. Instead, use Pyxis with Enroot, a container-based solution for managing and isolating software environments.

Slurm runs on top of CoreWeave Kubernetes Service (CKS), where the operating system runs in an ephemeral container. Because of this, system software installed on the Operating System disk of login and compute pods will not persist on the operating system disk through reboots.

There are several methods to install software permanently on the cluster, each of which are best suited to different use cases.

Pyxis/enroots6-overlayDistributed File System (DFS)
Recommended for cases where a lot of software is required on compute nodes.Used to install software on Slurm login and compute nodes as part of the Slurm on Kubernetes deployment. Recommended for installing development tools, such as text editors and git.Recommended for persistent storage.
Tip

CoreWeave AI Object Storage is a highly recommended, S3-compatible solution for object storage, most especially for storing and loading model data and checkpoints due to its exceptional performance, Node-local caching on GPUs, and cross-Zone support.

Install software as containers using the Pyxis/enroot environment

Running high-performance code on GPUs has its own specific challenges. To make this experience as seamless as possible, CoreWeave leverages Pyxis, a container environment developed by Nvidia specifically as a plugin for use with Slurm in GPU-accelerated environments.

Pyxis uses enroot to run containers, allowing unprivileged cluster users to run tasks in containers using srun. This provides a safer, superior way to manage software on Slurm nodes by encapsulating software environments, additionally making them easy to reproduce.

Using Pyxis in tandem with Slurm enables interactive development in your container environment. In this section are some examples of how to do this using an interactive shell on a compute node using enroot and Pyxis with Slurm.

Develop containers with Pyxis

Set credentials to pull Pyxis containers

To pull containers from protected repositories, you'll need to set the appropriate credentials. First, create an enroot directory, then create a .credentials file within that directory:

Create the credentials directory
$
mkdir -p ~/.config/enroot
$
touch ~/.config/enroot/.credentials

Add repository credentials to this file using netrc file format.

See examples: Credential formats

NVIDIA NGC: Generate an NGC API key if you do not already have one. Then, open ~/.config/enroot/.credentials and enter the following in the file, replacing <API KEY> with your NGC key:

NGC example
machine nvcr.io login $oauthtoken password <API_KEY>

Docker Hub: Add the following information into your ~/.config/enroot/.credentials file, replacing <LOGIN> with your Docker Hub login and <PASSWORD> with your password.

Docker Hub example
machine auth.docker.io login <LOGIN> password <PASSWORD>

If enroot has trouble finding your credentials, export the ENROOT_CONFIG_PATH variable to point to the directory where your credentials are stored, ideally in your .bashrc file, so that it is set persistently:

Example
$
export ENROOT_CONFIG_PATH=${HOME}/.config/enroot/

Pull and modify a container using enroot

Pulling and modifying containers on compute nodes can be useful for debugging or for creating a container image for the first time.

First, create an interactive session on a compute node. In this case, we are requesting an interactive session on an exclusive H100 node in the H100 partition.

Example
$
srun -p h100 --exclusive --pty bash -i

Next, use enroot to import and run an image from Dockerhub.

Example
$
enroot import docker://ubuntu

Use the create command to save the Docker image as a squash file (.sqsh).

Example
$
enroot create ubuntu.sqsh

Finally, run enroot with the start command using the --rw flag. This makes the container root system readable and writable, ensuring any filesystem changes that are made after starting the container are persistent.

Example
$
enroot start --rw ubuntu

You can also mount local files on the container using the -m flag.

Info

For more information, see the official enroot documentation.

Pull and modify a container using Slurm

Tip

Using Slurm to pull and modify containers is recommended when containers are consistently running as expected. If you are experiencing difficulties running containers with Slurm, try using enroot directly instead as described above.

To pull and modify a container using Slurm, first create an interactive bash session with Slurm.

From the login node, pull your container, and save it as a squash file. In this example, we pull the latest PyTorch container from CoreWeave and save it is a squash file.

Example
$
srun \
--container-image=ghcr.io#coreweave/ml-containers/nightly-torch:es-actions-8e29075-base-25011003-cuda12.6.3-ubuntu22.04-torch2.7.0a0-vision0.22.0a0-audio2.6.0a0 \
--container-save=/mnt/home/username/nightly-torch.sqsh \
echo "hello world"

Use the flag --container-save to specify where to save the container, and execute an echo command to specify a command for srun.

Mount the container

Using container-mounts, a local filesystem can be mounted into the container. In the example below, /mnt/home is mounted from the local cluster to /mnt/home on the container, and an interactive job is launched on a GPU node.

Example
$
srun -C gpu \
--container-image=/mnt/home/username/nightly-torch.sqsh \
--container-mounts=/mnt/home:/mnt/home \
--pty bash -i

Executing this command launches an interactive development shell on a GPU node. In this example, a test script (test_script.py) is run from /mnt containing a test for the CUDA installation.

CUDA installation test
import torch
print(torch.version.cuda)
print(torch.cuda.device_count())
print(torch.cuda.get_device_name())

This is tested interactively by running:

Run the CUDA installation test
$
python /mnt/home/username/test_script.py

Exiting the interactive shell does not delete the container or the squash file, and the test file test_script.py persists in /mnt/home/username. However, no changes to the root filesystem of the container files themselves are saved.

In the example command below, we launch an interactive session with the nightly-torch container, specifying that we will save it (using --container-save) as new-nightly-torch.sqsh. Now, changes can be made to the container itself, and they will persist in the new squash filesystem, new-nightly-torch.sqsh, after exiting the container.

Example
srun -C gpu \
--container-image=/mnt/home/username/nightly-torch.sqsh \
--container-mounts=/mnt/home:/mnt/home \
--container-save /mnt/home/username/new-nightly-torch.sqsh \
--pty bash -i

Install software as part of the Slurm on Kubernetes deployment using s6-overlay

Tip

As a best practice, we recommend installing the same packages on compute and login nodes in order to avoid user confusion. We also recommend limiting installed packages to usability applications such as git, text editors, and simple command line tools. This is configured where Slurm application values are configured, usually in a GitOps repo repository maintained by a Slurm administrator.

S6-overlay can be used to install software on the Slurm login and compute nodes at cluster initialization time, or to create long-running jobs, such as services.

For compute nodes, script details are defined in the compute.s6 section of the Slurm values.yaml file. For login nodes, script details are defined in the login.s6 section of the Slurm values.yaml file.

Each script needs a name, a type, and then the script itself in bash.

See an example: Installing rclone using s6-overlay
Example compute.s6 stanza
compute:
s6:
packages:
type: oneshot
script: |
#!/usr/bin/env bash
apt-get update
apt -y install rclone screen git

Install software in a persistent Distributed File Storage (DFS) mounted directory

Software installed in mounted DFS directories, such as your home directory or /mnt/data, persists on nodes through reboots or replacements.

Important

In general, it is highly recommended only to follow instructions for installing software that does not require root access or administrative privileges.

See an example: Conda

Python environments can often be installed without root privileges.

In this example, a conda environment called myenv using Python version 3.11 is created in a persistent DFS mounted directory at /mnt/data.

First, initialize conda to use bash with conda init. Then, source your .bashrc file.

Example
/opt/conda/bin/conda init bash
source ~/.bashrc

Next, create the conda environment with the desired specifications, where --prefix targets the location in which to store the environment.

Example
conda create --prefix /mnt/data/myenv python=3.11

Finally, activate the environment.

Example
conda activate /mnt/data/myenv