Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
In this section, we’ll submit a training job to Slurm using the PyTorch torchrun utility.
This section is designed to help you understand how to use Slurm with PyTorch or TensorFlow. This section involves a more complex job than the simple job in the previous section. It runs on multiple nodes with multiple GPUs, which more closely exemplifies a real-world scenario.
Submit a training job using PyTorch
The torchrun launcher is used to schedule a multi-node, multi-GPU PyTorch job on Slurm. The torchrun launcher is a superset of the launch utility from the torch.distributed package, which provides PyTorch support and communication primitives for multi-process parallelism across multiple GPUs running on one or more nodes.
It is easy to confuse the directives for Slurm with the parameters for torchrun. It may be useful to remember this distinction:
- Slurm is responsible for allocating the nodes to be used for PyTorch,
torchrun is responsible for spawning the jobs within a single node.
Example using PyTorch
Here is an annotated example script of a multi-node distributed training job using PyTorch, where the goal is to define the allocation and ensure that the srun command is run once on each node.
#!/bin/bash
#SBATCH --job-name=pytorch_test
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=1
#SBATCH --time=20:00
#SBATCH --exclusive
# Activate the Python virtual environment
source /mnt/home/username/test_env/bin/activate
# Set NCCL debugging and the network interface
export NCCL_DEBUG=INFO
export OMP_NUM_THREADS=1
# Set the RDZV_HOST variable to the hostname
export RDZV_HOST=$(hostname)
# Generate a reasonably unique number for the communication port.
export RDZV_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
echo "running on host $RDZV_HOST using port $RDZV_PORT"
echo "Starting job"
date
hostname
# Start the PyTorch Process
srun python3 \
-m torch.distributed.run \
--rdzv-id=$SLURM_JOB_ID \
--rdzv-backend=c10d \
--rdzv-endpoint="$RDZV_HOST:$RDZV_PORT" \
--nnodes=2 \
--nproc-per-node=8 \
/mnt/home/user/tnet2/tnet2/train_c3_cw.py
echo "Calculation complete"
echo "Ending job"
date
Here is the same script, annotated:
Example script with annotations
#!/bin/bash
#SBATCH --job-name=pytorch_test
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=1 # Critical configuration - sets 1 torchrun call per node.
#SBATCH --time=20:00
#SBATCH --exclusive
# Activate the Python virtual environment
source /mnt/home/username/test_env/bin/activate
# Set NCCL debugging and the network interface
export NCCL_DEBUG=INFO
# Environment variables used by PyTorch.
# These `export` commands are executed by the shell on the primary compute node
# before the `srun` command is executed, so the variables are only set once.
export OMP_NUM_THREADS=1
# Set the RDZV_HOST variable to the hostname
export RDZV_HOST=$(hostname) # RDZV_HOST is the hostname of the primary compute node.
# Generate a reasonably unique number for the communication port.
# RDZV_PORT is created to be a unique port name for PyTorch to use.
# Using the Slurm JobID (`$SLURM_JOBID`) as part of the port number reduces the chance
# of a job clash, should more than one job be running on the same node.
export RDZV_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
echo "running on host $RDZV_HOST using port $RDZV_PORT"
echo "Starting job"
date
hostname
# Start the PyTorch Process
srun python3 \
# Invoking `python -m torch.distributed.run` is equivalent to invoking `torchrun`.
# Torchrun should be invoked using the same arguments on each node that is
# participating in the training run. This srun command is run once on each
# node in the allocation by Slurm. In this example, it is run twice, using
# the same parameters. The host that is named in `--rdzv-endpoint` opens a
# port to listen from the other host.
-m torch.distributed.run \
--rdzv-id=$SLURM_JOB_ID \
--rdzv-backend=c10d \
# The Rendezvous Backend in PyTorch (parameters beginning with `--rdzv-`)
# coordinates how distributed workers find one another before beginning a
# distributed training job.
--rdzv-endpoint="$RDZV_HOST:$RDZV_PORT" \
--nnodes=2 \
--nproc-per-node=8 \
/mnt/home/user/tnet2/tnet2/train_c3_cw.py
echo "Calculation complete"
echo "Ending job"
date
Most of these parameters are similar to those used in the previous section, with a few notable exceptions:
#SBATCH --ntasks-per-node=1: This sbatch parameter is critically important when running multi-GPU PyTorch jobs with Slurm.
- Slurm takes care of nodes, and
torchrun takes care of what runs on them. If this line is omitted, the number of default tasks per node is 1.
- However, if this number were to be set higher, it might cause problems. It is therefore best practice in this configuration to explicitly set the number of tasks to
1, as is done here.
- For most other applications, this number would be set to the total number of desired tasks per node.
export OMP_NUM_THREADS=1: An environment variable, which sets the number of threads that run within an OpenMP job.
- Here, it is set to
1 explicitly.
- If it is not set, PyTorch relies on OpenMP’s default behavior, which may use as many threads as there are physical cores.
- Explicitly setting this parameter avoids accidentally launching more threads than there are processors on a node.
Invoking python -m torch.distributed.run, as is done in the script above, is equivalent to invoking torchrun. Torchrun should be invoked using the same arguments on each node that is participating in the training run, which the script above ensures.
Rendezvous Backend
The Rendezvous Backend in PyTorch coordinates how distributed workers find one another before beginning a distributed training job.
As is specified in the Rendezvous Backend documentation, multi-node training requires the following parameters:
--rdzv-id: A unique job ID, shared by all nodes participating in the job
--rdzv-backend: An implementation of the torch.distributed.elastic.rendezvous.RendezvousHandler interface
--rdzv-endpoint: The endpoint where the rendezvous backend is running, usually in host:port format
These parameters are implemented in the script above like so:
--rdzv-id=$SLURM_JOB_ID \
--rdzv-backend=c10d \
--rdzv-endpoint="$RDZV_HOST:$RDZV_PORT" \
In this case, the --rdzv-backend is the c10d backend, the --rdzv-endpoint is set to the $RDZV_HOST:$RDZV_PORT combination generated earlier in the script, and the --rdzv-id is set to $SLURM_JOB_ID, which is a single unique identifier for the entire job, which is the same on both nodes.
At this time, the following backends are supported without additional setup:
c10d (recommended)
etcd-v2 and etcd (legacy)
To use etcd-v2 or etcd, set up an etcd server with the v2 API enabled using the --enable-v2 parameter.
Launching the job in a container
To launch the same job inside a container, the script is similar, with a few key differences:
- A virtual environment is not used.
- A container already has the right components installed and configured; the script does not need to handle these installations.
- The
srun command specifies the container by targeting its squash file.
Example srun using a container
srun --container-image=/mnt/home/username/nightly-torch.sqsh \
--container-mounts=/mnt/home:/mnt/home \
--no-container-remap-root
This container image could be loaded from a shared filesystem, or pulled from a registry.
TensorFlow launch utilities with Slurm
TensorFlow differs from PyTorch. With TensorFlow, some of the logic from the Slurm script is moved into the Python code.
The following examples are not full Python code for TensorFlow, and will not run as they are presented; they are for example purposes only.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
tf.keras.backend.clear_session()
slurm_resolver = tf.distribute.cluster_resolver.SlurmClusterResolver(port_base=15000)
communication = tf.distribute.experimental.CommunicationImplementation.NCCL
mirrored_strategy = tf.distribute.MultiWorkerMirroredStrategy( \
cluster_resolver=slurm_resolver, communication_options=communication)
with mirrored_strategy.scope():
inputs = keras.Input(shape=(392,), name='img')
x = layers.Dense(32, activation='relu')(inputs)
x = layers.Dense(32, activation='relu')(x)
outputs = layers.Dense(6)(x)
model = keras.Model(inputs=inputs, outputs=outputs, name='name')
model.compile(.......) # model options here
# x_test and y_test hold test data
# Load x_train and y_train
history = model.fit(x_train, y_train,
batch_size=32,
epochs=5,
validation_split=0.2)
test_scores = model.evaluate(x_test, y_test, verbose=2)
print('Test loss:', test_scores[0])
print('Test accuracy:', test_scores[1])
To launch this job, the following Slurm script is used:
#!/bin/bash
#SBATCH --job-name=name # job name
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks-per-node=1 # number of MPI task per node
#SBATCH --gres=gpu:8 # number of GPUs per node
#SBATCH --cpus-per-task=32 # experiment to get the best value
#SBATCH --time=00:20:00 # job length
#SBATCH --output=%x_%j.out # std out filename
#SBATCH --error=%x_%j.out # std err filename
#SBATCH --exclusive # use entire machine
# Activate the python virtual environment
source /mnt/home/username/test_env/bin/activate
# Set NCCL debugging and network interface
export NCCL_DEBUG=INFO
export OMP_NUM_THREADS=1
echo "Starting job"
date
hostname
# Start Python Process
srun python example.py
echo "Calculation complete"
echo "Ending job"
date
Once a job is submitted to Slurm, it is visible via Grafana and via the command line.