Prerequisites
Before you begin, make sure you have the following:- Access to a SUNK cluster running on CoreWeave Kubernetes Service (CKS).
- A user account on the Slurm cluster, with your SSH public key registered through your organization’s identity provider. If you don’t have access, contact CoreWeave support.
kubectlaccess to the CKS cluster hosting Slurm, so you can look up the login node’s external address.- An SSH client on your local machine, along with the matching private key.
Connect to the Slurm login node with SSH
In this section, you locate the address of the Slurm login node and open an SSH session to it. The login node is the entry point for submitting Slurm jobs and managing data on the cluster.Customer identity providers (IdPs) manage user access to Slurm clusters, federated into a cluster-side directory service. CoreWeave grants access based on allowed users’ public SSH keys. For assistance, contact CoreWeave support.
Acquire the IP address or DNS record of the login node
Theslurm-login Service on your CKS cluster provides information about how to connect to the login node, including the external IP address or DNS record, if there is one.
To get the login service’s IP address or DNS record, use kubectl get svc slurm-login.
Obtain the external IP address
EXTERNAL-IP field in the output is the target IP address for SSH access.
Example output
Log in with SSH
See an example: SSH
See an example: SSH
In this example, the username is
exampleuser, and the target IP address is 203.0.113.100.Example: Log in with SSH
Verify the Slurm cluster
With an SSH session open on the login node, the next step is to confirm that Slurm itself is healthy and that compute nodes are reachable. To verify that your Slurm cluster is working as expected, run a basic Slurm job.Check for available nodes
First, check how many nodes you have available withsinfo:
List nodes in idle or mix states
-N flag instructs sinfo to list each node individually. You can add criteria to this query with the --states= flag.
--states=idle,mix limits the output to nodes in the idle and mix states. Nodes in these states are available to run workloads.
sinfo returns a list of nodes that match your criteria:
Example sinfo output
sinfo lists six available nodes.
Submit an interactive job
srun is the Slurm command that submits an interactive job to the Slurm cluster. Use it to discover the hostname of each available node.
In the following command, replace [AVAILABLE-NODES] with the number of available nodes in your Slurm cluster:
Example: Find the hostname on a specified number of nodes
-N flag, when used with srun, requests the specified number of Slurm nodes. For example, -N 6 requests six nodes to run a job.
If you request more nodes than are currently available, srun remains in a Pending state until all of the requested nodes become available.
The hostname command runs on each requested node, and prints the name of the machine for each node:
Example output
Add yourself as a Slurm user
root, run the preceding command for each account you need to be added to.
After completing this section, you have confirmed that the Slurm controller can see available compute nodes and that you can submit interactive jobs to them.
Verify data access
Now that the cluster is reachable and jobs can run, the next prerequisite for training is making sure your data lives in persistent storage that every node can read. Before you can run a training job, you’ll need to transfer your data into the cluster’s persistent storage. POSIX-compliant shared storage on your cluster is persistent Distributed File Storage (DFS). Anything stored in DFS is available across all the nodes of your Slurm cluster, and remains even when nodes are replaced. Shared DFS storage is mounted on login and compute nodes. Data is usually mounted in/mnt, in directories such as /mnt/home and /mnt/data.
Check available DFS space
See how much space is available withdf -H.
See an example: `df -H`
See an example: `df -H`
In this example, the mount directories are
/mnt/data and /mnt/home.Example: Check available space
Example output
Copy files from the local machine to the Slurm cluster
Recommended tools for transferring data arescp or rsync.
rsync is faster for large directories. If interrupted, pick up where you left off by reissuing the command.
See an example: `rsync` and `scp`
See an example: `rsync` and `scp`
To copy data from the local machine to the remote machine, use these commands:Using Using To copy data from the remote machine to the local machine, reverse the commands:Using Using
rsync:scp:rsync:scp:Install software on the Slurm cluster
With data in place, the final preparation step is making sure the software your training job needs is available on compute nodes in a way that survives reboots and node replacements. This section explains the recommended approaches and why they exist. Slurm runs on top of CoreWeave Kubernetes Service (CKS), where the operating system runs in an ephemeral container. Because of this, system software installed on the operating system disk of login and compute pods doesn’t persist on the operating system disk through reboots. Several methods can install software permanently on the cluster, each suited to different use cases.| Pyxis and enroot | s6-overlay | Distributed File System (DFS) |
|---|---|---|
| Recommended for cases where compute nodes require a lot of software. | Installs software on Slurm login and compute nodes as part of the SUNK deployment. Recommended for installing development tools, such as text editors and git. | Recommended for persistent storage. |
Install software as containers using Pyxis and enroot
Running high-performance code on GPUs presents its own challenges. CoreWeave uses Pyxis, a container environment developed by NVIDIA as a plugin for use with Slurm in GPU-accelerated environments. Pyxis usesenroot to run containers, which lets unprivileged cluster users run tasks in containers with srun. This provides a safer way to manage software on Slurm nodes by encapsulating software environments, and makes them reproducible.
Pyxis with Slurm enables interactive development in your container environment. The following sections describe how to do this with an interactive shell on a compute node using enroot and Pyxis with Slurm.
Set credentials to pull Pyxis containers
To pull containers from protected repositories, set the appropriate credentials. First, create anenroot directory, then create a .credentials file within that directory:
Create the credentials directory
netrc file format.
See an example: Credential formats
See an example: Credential formats
NVIDIA NGC: Generate an NGC API key if you don’t already have one. Then, open Docker Hub: Add the following information into your
~/.config/enroot/.credentials and enter the following in the file, replacing [API-KEY] with your NGC key:NGC example
~/.config/enroot/.credentials file, replacing [LOGIN] with your Docker Hub login and [PASSWORD] with your password.Docker Hub example
enroot can’t find your credentials, export the ENROOT_CONFIG_PATH variable to point to the directory where your credentials are stored, ideally in your .bashrc file, so that it’s set persistently:
Pull and modify a container using enroot
Pulling and modifying containers on compute nodes can be useful for debugging or for creating a container image for the first time. First, create an interactive session on a compute node. In this case, you request an interactive session on an exclusive H100 node in the H100 partition.enroot to import and run an image from Docker Hub.
create command to save the Docker image as a squash file (.sqsh).
enroot with the start command using the --rw flag. This makes the container root system readable and writable, so any filesystem changes made after you start the container persist.
-m flag.
For more information, see the official enroot documentation.
Pull and modify a container using Slurm
To pull and modify a container using Slurm, first create an interactive bash session with Slurm. From the login node, pull your container, and save it as a squash file. In this example, you pull the latest PyTorch container from CoreWeave and save it as a squash file.--container-save to specify where to save the container, and execute an echo command to specify a command for srun.
Mount the container
With container-mounts, you can mount a local filesystem into the container. In the following example,/mnt/home is mounted from the local cluster to /mnt/home on the container, and an interactive job launches on a GPU node.
test_script.py) runs from /mnt and tests the CUDA installation.
CUDA installation test
Run the CUDA installation test
test_script.py persists in /mnt/home/username. However, no changes to the root filesystem of the container files themselves are saved.
In the following example command, you launch an interactive session with the nightly-torch container and specify that it saves (using --container-save) as new-nightly-torch.sqsh. Now, you can make changes to the container itself, and they persist in the new squash filesystem, new-nightly-torch.sqsh, after you exit the container.
Install software at deployment time using s6-overlay
Use s6-overlay to install software on the Slurm login and compute nodes at cluster initialization time, or to create long-running jobs, such as services. For compute nodes, define script details in thecompute.s6 section of the Slurm values.yaml file. For login nodes, define script details in the login.s6 section of the Slurm values.yaml file.
Each script needs a name, a type, and the script itself in bash.
See an example: Installing `rclone` using s6-overlay
See an example: Installing `rclone` using s6-overlay
Example compute.s6 stanza
Install software in a persistent DFS mounted directory
Software installed in mounted DFS directories, such as your home directory or/mnt/data, persists on nodes through reboots or replacements.
See an example: Conda
See an example: Conda
You can often install Python environments without root privileges.In this example, you create a conda environment called Next, create the conda environment with the desired specifications, where Finally, activate the environment.
myenv using Python version 3.11 in a persistent DFS mounted directory at /mnt/data.First, initialize conda to use bash with conda init. Then, source your .bashrc file.--prefix targets the location in which to store the environment.