Skip to main content

3. Submit a training job with PyTorch or TensorFlow

In this section, we'll submit a training job to Slurm using the PyTorch torchrun utility.

This section is designed to help you understand how to use Slurm with PyTorch or TensorFlow. This section involves a more complex job than the simple job in the previous section. It runs on multiple nodes with multiple GPUs, which more closely exemplifies a real-world scenario.

Submit a training job using PyTorch

The torchrun launcher is used to schedule a multi-node, multi-GPU PyTorch job on Slurm. The torchrun launcher is a superset of the launch utility from the torch.distributed package, which provides PyTorch support and communication primitives for multi-process parallelism across multiple GPUs running on one or more nodes.

Info

It is easy to confuse the directives for Slurm with the parameters for torchrun. It may be useful to remember this distinction:

  • Slurm is responsible for allocating the nodes to be used for PyTorch,
  • torchrun is responsible for spawning the jobs within a single node.

Example using PyTorch

Here is an annotated example script of a multi-node distributed training job using PyTorch, where the goal is to define the allocation and ensure that the srun command is run once on each node.

Example
#!/bin/bash
#SBATCH --job-name=pytorch_test
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=1
#SBATCH --time=20:00
#SBATCH --exclusive
# Activate the Python virtual environment
source /mnt/home/username/test_env/bin/activate
# Set NCCL debugging and the network interface
export NCCL_DEBUG=INFO
export OMP_NUM_THREADS=1
# Set the RDZV_HOST variable to the hostname
export RDZV_HOST=$(hostname)
# Generate a reasonably unique number for the communication port.
export RDZV_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
echo "running on host $RDZV_HOST using port $RDZV_PORT"
echo "Starting job"
date
hostname
# Start the PyTorch Process
srun python3 \
-m torch.distributed.run \
--rdzv-id=$SLURM_JOB_ID \
--rdzv-backend=c10d \
--rdzv-endpoint="$RDZV_HOST:$RDZV_PORT" \
--nnodes=2 \
--nproc-per-node=8 \
/mnt/home/user/tnet2/tnet2/train_c3_cw.py
echo "Calculation complete"
echo "Ending job"
date

Here is the same script, annotated:

Example script with annotations
#!/bin/bash
#SBATCH --job-name=pytorch_test
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=1
Critical configuration - sets 1 torchrun call per node.
#SBATCH --time=20:00
#SBATCH --exclusive
# Activate the Python virtual environment
source /mnt/home/username/test_env/bin/activate
# Set NCCL debugging and the network interface
export NCCL_DEBUG=INFO
export OMP_NUM_THREADS=1
Environment variables used by PyTorch. These `export` commands are executed by the shell on the primary compute node before the `srun` command is executed, so the variables are only set once.
# Set the RDZV_HOST variable to the hostname
export RDZV_HOST=$(hostname)
RDZV_HOST is the hostname of the primary compute node.
# Generate a reasonably unique number for the communication port.
export RDZV_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
RDZV_PORT is created to be a unique port name for PyTorch to use. Using the Slurm JobID (`$SLURM_JOBID`) as part of the port number reduces the chance of a job clash, should more than one job be running on the same node.
echo "running on host $RDZV_HOST using port $RDZV_PORT"
echo "Starting job"
date
hostname
# Start the PyTorch Process
srun python3 \
-m torch.distributed.run \
Invoking `python -m torch.distributed.run` is equivalent to invoking `torchrun`. Torchrun should be invoked using the same arguments on each node that is participating in the training run. This srun command is run once on each node in the allocation by Slurm. In this example, it is run twice, using the same parameters. The host that is named in `--rdzv-endpoint` opens a port to listen from the other host.
--rdzv-id=$SLURM_JOB_ID \
--rdzv-backend=c10d \
--rdzv-endpoint="$RDZV_HOST:$RDZV_PORT" \
The Rendezvous Backend in PyTorch (parameters beginning with `--rdzv-`) coordinates how distributed workers find one another before beginning a distributed training job.
--nnodes=2 \
--nproc-per-node=8 \
/mnt/home/user/tnet2/tnet2/train_c3_cw.py
echo "Calculation complete"
echo "Ending job"
date

Most of these parameters are similar to those used in the previous section, with a few notable exceptions:

  • #SBATCH --ntasks-per-node=1: This sbatch parameter is critically important when running multi-GPU PyTorch jobs with Slurm.
    • Slurm takes care of nodes, and torchrun takes care of what runs on them. If this line is omitted, the number of default tasks per node is 1.
    • However, if this number were to be set higher, it might cause problems. It is therefore best practice in this configuration to explicitly set the number of tasks to 1, as is done here.
    • For most other applications, this number would be set to the total number of desired tasks per node.
  • export OMP_NUM_THREADS=1: An environment variable, which sets the number of threads that run within an OpenMP job.
    • Here, it is set to 1 explicitly.
    • If it is not set, PyTorch relies on OpenMP's default behavior, which may use as many threads as there are physical cores.
    • Explicitly setting this parameter avoids accidentally launching more threads than there are processors on a node.
Info

Invoking python -m torch.distributed.run, as is done in the script above, is equivalent to invoking torchrun. Torchrun should be invoked using the same arguments on each node that is participating in the training run, which the script above ensures.

Rendezvous Backend

The Rendezvous Backend in PyTorch coordinates how distributed workers find one another before beginning a distributed training job.

As is specified in the Rendezvous Backend documentation, multi-node training requires the following parameters:

  • --rdzv-id: A unique job ID, shared by all nodes participating in the job
  • --rdzv-backend: An implementation of the torch.distributed.elastic.rendezvous.RendezvousHandler interface
  • --rdzv-endpoint: The endpoint where the rendezvous backend is running, usually in host:port format

These parameters are implemented in the script above like so:

Rendezvous parameters
--rdzv-id=$SLURM_JOB_ID \
--rdzv-backend=c10d \
--rdzv-endpoint="$RDZV_HOST:$RDZV_PORT" \

In this case, the --rdzv-backend is the c10d backend, the --rdzv-endpoint is set to the $RDZV_HOST:$RDZV_PORT combination generated earlier in the script, and the --rdzv-id is set to $SLURM_JOB_ID, which is a single unique identifier for the entire job, which is the same on both nodes.

At this time, the following backends are supported without additional setup:

  • c10d (recommended)
  • etcd-v2 and etcd (legacy)
Info

To use etcd-v2 or etcd, set up an etcd server with the v2 API enabled using the --enable-v2 parameter.

Launching the job in a container

To launch the same job inside a container, the script is similar, with a few key differences:

  • A virtual environment is not used.
  • A container already has the right components installed and configured; the script does not need to handle these installations.
  • The srun command specifies the container by targeting its squash file.
Example srun using a container
srun --container-image=/mnt/home/username/nightly-torch.sqsh \
--container-mounts=/mnt/home:/mnt/home \
--no-container-remap-root \

This container image could be loaded from a shared filesystem, or pulled from a registry.

Info

For more information on developing on containers, see Develop containers with Pyxis in the previous section.

TensorFlow launch utilities with Slurm

TensorFlow differs from PyTorch. With TensorFlow, some of the logic from the Slurm script is moved into the Python code.

Warning

The following examples are not full Python code for TensorFlow, and will not run as they are presented; they are for example purposes only.

Example
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
tf.keras.backend.clear_session()
slurm_resolver = tf.distribute.cluster_resolver.SlurmClusterResolver(port_base=15000)
communication = tf.distribute.experimental.CommunicationImplementation.NCCL
mirrored_strategy = tf.distribute.MultiWorkerMirroredStrategy( \
cluster_resolver=slurm_resolver, communication_options=communication)
with mirrored_strategy.scope():
inputs = keras.Input(shape=(392,), name='img')
x = layers.Dense(32, activation='relu')(inputs)
x = layers.Dense(32, activation='relu')(x)
outputs = layers.Dense(6)(x)
model = keras.Model(inputs=inputs, outputs=outputs, name='name')
model.compile(.......) # model options here
# x_test and y_test hold test data
# Load x_train and y_train
history = model.fit(x_train, y_train,
batch_size=32,
epochs=5,
validation_split=0.2)
test_scores = model.evaluate(x_test, y_test, verbose=2)
print('Test loss:', test_scores[0])
print('Test accuracy:', test_scores[1])

To launch this job, the following Slurm script is used:

Example
#!/bin/bash
#SBATCH --job-name=name # job name
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks-per-node=1 # number of MPI task per node
#SBATCH --gres=gpu:8 # number of GPUs per node
#SBATCH --cpus-per-task=32 # experiment to get the best value
#SBATCH --time=00:20:00 # job length
#SBATCH --output=%x_%j.out # std out filename
#SBATCH --error=%x_%j.out # std err filename
#SBATCH --exclusive # use entire machine
# Activate the python virtual environment
source /mnt/home/username/test_env/bin/activate
# Set NCCL debugging and network interface
export NCCL_DEBUG=INFO
export OMP_NUM_THREADS=1
echo "Starting job"
date
hostname
# Start Python Process
srun python example.py
echo "Calculation complete"
echo "Ending job"
date

Once a job is submitted to Slurm, it is visible via Grafana and via the command line.