1. Set up a Slurm cluster
Connect to the Slurm login node with SSH
User access to Slurm clusters is managed via customer Identity Providers (IdPs), federated into a cluster-side directory service. Access is granted based on allowed users' public SSH keys. For assistance, please contact CoreWeave support.
Acquire the IP address or DNS record of the login node
The slurm-login
Service on your CKS cluster provides information about how to connect to the login node, including the external IP address or DNS record, if there is one.
To get the login service's IP address or DNS record, use kubectl get svc slurm-login
.
$kubectl get svc slurm-login
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEslurm-login LoadBalancer 192.0.2.100 203.0.113.100 22/TCP 2d21hThe `EXTERNAL-IP` field in the output is the target IP address for SSH access.
SSH in to the login node using either this IP address, or the DNS record created for the login node if there is one.
$ssh -i /path/to/ed25519_private_key USERNAME@EXTERNAL-IP
See an example: SSH
In this example, the username is exampleuser
, and the target IP address is 203.0.113.100
.
$ssh -i ~/.ssh/ed25519-key [email protected]
On successful login, a welcome message will be presented, along with a command prompt:
Welcome to a CoreWeave Slurm HPC ClusterUSERNAME@slurm-login-0:~$
You are now logged into the Slurm login node, and from here you can start to run Slurm commands.
Verify the Slurm cluster
To verify that your Slurm cluster is working as expected, run a simple Slurm job. One simple task is to discover the hostname on 6 nodes using srun
:
$srun -N 6 hostname
slurm-rtx4000-3slurm-rtx4000-1slurm-rtx4000-0slurm-cpu-epyc-0slurm-cpu-epyc-1slurm-rtx4000-2
Verify Data Access
Before you can run a training job, you'll need to transfer your data into the cluster's persistent storage. POSIX-compliant shared storage on your cluster is persistent Distributed File Storage (DFS). Anything stored in DFS is available across all the nodes of your Slurm cluster, and remains even if nodes are replaced.
Any data placed on a node outside DFS will not persist through node restarts or replaced nodes.
Shared DFS storage is mounted on login and compute nodes. Typically, data are mounted in /mnt
, in directories such as /mnt/home
and /mnt/data
.
Check available DFS space
See how much space is available with user -H
.
See an example: user -H
In this example, the mount directories are /mnt/data
and /mnt/home
.
$user -H /mnt/data /mnt/home
Filesystem Size Used Avail Use% Mounted on100.121.2.187:/k8s/pvc-7236f9d2-d948-42b1-b909-ff634723fcc2 11T 0 11T 0% /mnt/data100.121.2.182:/k8s/pvc-f1ada11a-11f3-4b73-84f0-8b01e8c4bfae 1.1T 4.2M 1.1T 1% /mnt/home
Copy files from the local machine to the Slurm cluster
Recommended tools for transferring data are scp
or rsync
.
Using rsync
is typically faster when working with large directories. If interrupted, pick up where you left off by reissuing the command.
See an example: rsync
and scp
To copy data from the local machine to the remote machine, use these commands:
Using rsync
:
$rsync -avz -e "ssh -i ~/.ssh/id_ed25519" ./data/ USERNAME@IP_ADDRESS:/remote/path/
Using scp
:
$scp -i /path/to/ed_25519-private-key /path/to/local-file-to-copy USERNAME@IP_ADDRESS:/remote/path
To copy data from the remote machine to the local machine, reverse the commands:
Using rsync
:
$rsync -avz -e "ssh -i ~/.ssh/id_ed25519" USERNAME@IP_ADDRESS:/remote/path/ local-path
Using scp
:
$scp -i /path/to/ed_25519-private-key USERNAME@IP_ADDRESS:remote-file-to-copy local-path
Install software on the Slurm cluster
It is not considered best practice to install software directly onto Slurm compute nodes. Instead, use Pyxis with Enroot, a container-based solution for managing and isolating software environments.
Slurm runs on top of CoreWeave Kubernetes Service (CKS), where the operating system runs in an ephemeral container. Because of this, system software installed on the Operating System disk of login and compute pods will not persist on the operating system disk through reboots.
There are several methods to install software permanently on the cluster, each of which are best suited to different use cases.
Pyxis/enroot | s6-overlay | Distributed File System (DFS) |
---|---|---|
Recommended for cases where a lot of software is required on compute nodes. | Used to install software on Slurm login and compute nodes as part of the Slurm on Kubernetes deployment. Recommended for installing development tools, such as text editors and git. | Recommended for persistent storage. |
CoreWeave AI Object Storage is a highly recommended, S3-compatible solution for object storage, most especially for storing and loading model data and checkpoints due to its exceptional performance, Node-local caching on GPUs, and cross-Zone support.
Install software as containers using the Pyxis/enroot
environment
Running high-performance code on GPUs has its own specific challenges. To make this experience as seamless as possible, CoreWeave leverages Pyxis, a container environment developed by Nvidia specifically as a plugin for use with Slurm in GPU-accelerated environments.
Pyxis uses enroot
to run containers, allowing unprivileged cluster users to run tasks in containers using srun
. This provides a safer, superior way to manage software on Slurm nodes by encapsulating software environments, additionally making them easy to reproduce.
Using Pyxis in tandem with Slurm enables interactive development in your container environment. In this section are some examples of how to do this using an interactive shell on a compute node using enroot
and Pyxis with Slurm.
Develop containers with Pyxis
Set credentials to pull Pyxis containers
To pull containers from protected repositories, you'll need to set the appropriate credentials. First, create an enroot
directory, then create a .credentials
file within that directory:
$mkdir -p ~/.config/enroot$touch ~/.config/enroot/.credentials
Add repository credentials to this file using netrc
file format.
See examples: Credential formats
NVIDIA NGC: Generate an NGC API key if you do not already have one. Then, open ~/.config/enroot/.credentials
and enter the following in the file, replacing <API KEY>
with your NGC key:
machine nvcr.io login $oauthtoken password <API_KEY>
Docker Hub: Add the following information into your ~/.config/enroot/.credentials
file, replacing <LOGIN>
with your Docker Hub login and <PASSWORD>
with your password.
machine auth.docker.io login <LOGIN> password <PASSWORD>
If enroot
has trouble finding your credentials, export the ENROOT_CONFIG_PATH
variable to point to the directory where your credentials are stored, ideally in your .bashrc
file, so that it is set persistently:
$export ENROOT_CONFIG_PATH=${HOME}/.config/enroot/
Pull and modify a container using enroot
Pulling and modifying containers on compute nodes can be useful for debugging or for creating a container image for the first time.
First, create an interactive session on a compute node. In this case, we are requesting an interactive session on an exclusive H100 node in the H100 partition.
$srun -p h100 --exclusive --pty bash -i
Next, use enroot
to import and run an image from Dockerhub.
$enroot import docker://ubuntu
Use the create
command to save the Docker image as a squash file (.sqsh
).
$enroot create ubuntu.sqsh
Finally, run enroot
with the start
command using the --rw
flag. This makes the container root system readable and writable, ensuring any filesystem changes that are made after starting the container are persistent.
$enroot start --rw ubuntu
You can also mount local files on the container using the -m
flag.
For more information, see the official enroot documentation.
Pull and modify a container using Slurm
Using Slurm to pull and modify containers is recommended when containers are consistently running as expected. If you are experiencing difficulties running containers with Slurm, try using enroot directly instead as described above.
To pull and modify a container using Slurm, first create an interactive bash session with Slurm.
From the login node, pull your container, and save it as a squash file. In this example, we pull the latest PyTorch container from CoreWeave and save it is a squash file.
$srun \--container-image=ghcr.io#coreweave/ml-containers/nightly-torch:es-actions-8e29075-base-25011003-cuda12.6.3-ubuntu22.04-torch2.7.0a0-vision0.22.0a0-audio2.6.0a0 \--container-save=/mnt/home/username/nightly-torch.sqsh \echo "hello world"
Use the flag --container-save
to specify where to save the container, and execute an echo
command to specify a command for srun
.
Mount the container
Using container-mounts, a local filesystem can be mounted into the container. In the example below, /mnt/home
is mounted from the local cluster to /mnt/home
on the container, and an interactive job is launched on a GPU node.
$srun -C gpu \--container-image=/mnt/home/username/nightly-torch.sqsh \--container-mounts=/mnt/home:/mnt/home \--pty bash -i
Executing this command launches an interactive development shell on a GPU node. In this example, a test script (test_script.py
) is run from /mnt
containing a test for the CUDA installation.
import torchprint(torch.version.cuda)print(torch.cuda.device_count())print(torch.cuda.get_device_name())
This is tested interactively by running:
$python /mnt/home/username/test_script.py
Exiting the interactive shell does not delete the container or the squash file, and the test file test_script.py
persists in /mnt/home/username
. However, no changes to the root filesystem of the container files themselves are saved.
In the example command below, we launch an interactive session with the nightly-torch
container, specifying that we will save it (using --container-save
) as new-nightly-torch.sqsh
. Now, changes can be made to the container itself, and they will persist in the new squash filesystem, new-nightly-torch.sqsh
, after exiting the container.
srun -C gpu \--container-image=/mnt/home/username/nightly-torch.sqsh \--container-mounts=/mnt/home:/mnt/home \--container-save /mnt/home/username/new-nightly-torch.sqsh \--pty bash -i
Install software as part of the Slurm on Kubernetes deployment using s6-overlay
As a best practice, we recommend installing the same packages on compute and login nodes in order to avoid user confusion. We also recommend limiting installed packages to usability applications such as git
, text editors, and simple command line tools. This is configured where Slurm application values are configured, usually in a GitOps repo repository maintained by a Slurm administrator.
S6-overlay can be used to install software on the Slurm login and compute nodes at cluster initialization time, or to create long-running jobs, such as services.
For compute nodes, script details are defined in the compute.s6
section of the Slurm values.yaml
file. For login nodes, script details are defined in the login.s6
section of the Slurm values.yaml
file.
Each script needs a name, a type, and then the script itself in bash
.
See an example: Installing rclone
using s6-overlay
compute:s6:packages:type: oneshotscript: |#!/usr/bin/env bashapt-get updateapt -y install rclone screen git
Install software in a persistent Distributed File Storage (DFS) mounted directory
Software installed in mounted DFS directories, such as your home directory or /mnt/data
, persists on nodes through reboots or replacements.
In general, it is highly recommended only to follow instructions for installing software that does not require root access or administrative privileges.
See an example: Conda
Python environments can often be installed without root privileges.
In this example, a conda environment called myenv
using Python version 3.11 is created in a persistent DFS mounted directory at /mnt/data
.
First, initialize conda to use bash
with conda init
. Then, source your .bashrc
file.
/opt/conda/bin/conda init bashsource ~/.bashrc
Next, create the conda environment with the desired specifications, where --prefix
targets the location in which to store the environment.
conda create --prefix /mnt/data/myenv python=3.11
Finally, activate the environment.
conda activate /mnt/data/myenv