3. Submit a training job with PyTorch or TensorFlow
In this section, we'll submit a training job to Slurm using the PyTorch torchrun
utility.
This section is designed to help you understand how to use Slurm with PyTorch or TensorFlow. This section involves a more complex job than the simple job in the previous section. It runs on multiple nodes with multiple GPUs, which more closely exemplifies a real-world scenario.
Submit a training job using PyTorch
The torchrun
launcher is used to schedule a multi-node, multi-GPU PyTorch job on Slurm. The torchrun
launcher is a superset of the launch utility from the torch.distributed
package, which provides PyTorch support and communication primitives for multi-process parallelism across multiple GPUs running on one or more nodes.
It is easy to confuse the directives for Slurm with the parameters for torchrun
. It may be useful to remember this distinction:
- Slurm is responsible for allocating the nodes to be used for PyTorch,
torchrun
is responsible for spawning the jobs within a single node.
Example using PyTorch
Here is an annotated example script of a multi-node distributed training job using PyTorch, where the goal is to define the allocation and ensure that the srun
command is run once on each node.
#!/bin/bash#SBATCH --job-name=pytorch_test#SBATCH --nodes=2#SBATCH --gres=gpu:8#SBATCH --ntasks-per-node=1#SBATCH --time=20:00#SBATCH --exclusive# Activate the Python virtual environmentsource /mnt/home/username/test_env/bin/activate# Set NCCL debugging and the network interfaceexport NCCL_DEBUG=INFOexport OMP_NUM_THREADS=1# Set the RDZV_HOST variable to the hostnameexport RDZV_HOST=$(hostname)# Generate a reasonably unique number for the communication port.export RDZV_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))echo "running on host $RDZV_HOST using port $RDZV_PORT"echo "Starting job"datehostname# Start the PyTorch Processsrun python3 \-m torch.distributed.run \--rdzv-id=$SLURM_JOB_ID \--rdzv-backend=c10d \--rdzv-endpoint="$RDZV_HOST:$RDZV_PORT" \--nnodes=2 \--nproc-per-node=8 \/mnt/home/user/tnet2/tnet2/train_c3_cw.pyecho "Calculation complete"echo "Ending job"date
Here is the same script, annotated:
#!/bin/bash#SBATCH --job-name=pytorch_test#SBATCH --nodes=2#SBATCH --gres=gpu:8#SBATCH --ntasks-per-node=1Critical configuration - sets 1 torchrun call per node.#SBATCH --time=20:00#SBATCH --exclusive# Activate the Python virtual environmentsource /mnt/home/username/test_env/bin/activate# Set NCCL debugging and the network interfaceexport NCCL_DEBUG=INFOexport OMP_NUM_THREADS=1Environment variables used by PyTorch. These `export` commands are executed by the shell on the primary compute node before the `srun` command is executed, so the variables are only set once.# Set the RDZV_HOST variable to the hostnameexport RDZV_HOST=$(hostname)RDZV_HOST is the hostname of the primary compute node.# Generate a reasonably unique number for the communication port.export RDZV_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))RDZV_PORT is created to be a unique port name for PyTorch to use. Using the Slurm JobID (`$SLURM_JOBID`) as part of the port number reduces the chance of a job clash, should more than one job be running on the same node.echo "running on host $RDZV_HOST using port $RDZV_PORT"echo "Starting job"datehostname# Start the PyTorch Processsrun python3 \-m torch.distributed.run \Invoking `python -m torch.distributed.run` is equivalent to invoking `torchrun`. Torchrun should be invoked using the same arguments on each node that is participating in the training run. This srun command is run once on each node in the allocation by Slurm. In this example, it is run twice, using the same parameters. The host that is named in `--rdzv-endpoint` opens a port to listen from the other host.--rdzv-id=$SLURM_JOB_ID \--rdzv-backend=c10d \--rdzv-endpoint="$RDZV_HOST:$RDZV_PORT" \The Rendezvous Backend in PyTorch (parameters beginning with `--rdzv-`) coordinates how distributed workers find one another before beginning a distributed training job.--nnodes=2 \--nproc-per-node=8 \/mnt/home/user/tnet2/tnet2/train_c3_cw.pyecho "Calculation complete"echo "Ending job"date
Most of these parameters are similar to those used in the previous section, with a few notable exceptions:
#SBATCH --ntasks-per-node=1
: Thissbatch
parameter is critically important when running multi-GPU PyTorch jobs with Slurm.- Slurm takes care of nodes, and
torchrun
takes care of what runs on them. If this line is omitted, the number of default tasks per node is1
. - However, if this number were to be set higher, it might cause problems. It is therefore best practice in this configuration to explicitly set the number of tasks to
1
, as is done here. - For most other applications, this number would be set to the total number of desired tasks per node.
- Slurm takes care of nodes, and
export OMP_NUM_THREADS=1
: An environment variable, which sets the number of threads that run within an OpenMP job.- Here, it is set to
1
explicitly. - If it is not set, PyTorch relies on OpenMP's default behavior, which may use as many threads as there are physical cores.
- Explicitly setting this parameter avoids accidentally launching more threads than there are processors on a node.
- Here, it is set to
Invoking python -m torch.distributed.run
, as is done in the script above, is equivalent to invoking torchrun
. Torchrun should be invoked using the same arguments on each node that is participating in the training run, which the script above ensures.
Rendezvous Backend
The Rendezvous Backend in PyTorch coordinates how distributed workers find one another before beginning a distributed training job.
As is specified in the Rendezvous Backend documentation, multi-node training requires the following parameters:
--rdzv-id
: A unique job ID, shared by all nodes participating in the job--rdzv-backend
: An implementation of thetorch.distributed.elastic.rendezvous.RendezvousHandler
interface--rdzv-endpoint
: The endpoint where the rendezvous backend is running, usually inhost:port
format
These parameters are implemented in the script above like so:
--rdzv-id=$SLURM_JOB_ID \--rdzv-backend=c10d \--rdzv-endpoint="$RDZV_HOST:$RDZV_PORT" \
In this case, the --rdzv-backend
is the c10d
backend, the --rdzv-endpoint
is set to the $RDZV_HOST:$RDZV_PORT
combination generated earlier in the script, and the --rdzv-id
is set to $SLURM_JOB_ID
, which is a single unique identifier for the entire job, which is the same on both nodes.
At this time, the following backends are supported without additional setup:
c10d
(recommended)etcd-v2
andetcd
(legacy)
To use etcd-v2
or etcd
, set up an etcd server with the v2
API enabled using the --enable-v2
parameter.
Launching the job in a container
To launch the same job inside a container, the script is similar, with a few key differences:
- A virtual environment is not used.
- A container already has the right components installed and configured; the script does not need to handle these installations.
- The
srun
command specifies the container by targeting its squash file.
srun --container-image=/mnt/home/username/nightly-torch.sqsh \--container-mounts=/mnt/home:/mnt/home \--no-container-remap-root \
This container image could be loaded from a shared filesystem, or pulled from a registry.
For more information on developing on containers, see Develop containers with Pyxis in the previous section.
TensorFlow launch utilities with Slurm
TensorFlow differs from PyTorch. With TensorFlow, some of the logic from the Slurm script is moved into the Python code.
The following examples are not full Python code for TensorFlow, and will not run as they are presented; they are for example purposes only.
import tensorflow as tffrom tensorflow import kerasfrom tensorflow.keras import layerstf.keras.backend.clear_session()slurm_resolver = tf.distribute.cluster_resolver.SlurmClusterResolver(port_base=15000)communication = tf.distribute.experimental.CommunicationImplementation.NCCLmirrored_strategy = tf.distribute.MultiWorkerMirroredStrategy( \cluster_resolver=slurm_resolver, communication_options=communication)with mirrored_strategy.scope():inputs = keras.Input(shape=(392,), name='img')x = layers.Dense(32, activation='relu')(inputs)x = layers.Dense(32, activation='relu')(x)outputs = layers.Dense(6)(x)model = keras.Model(inputs=inputs, outputs=outputs, name='name')model.compile(.......) # model options here# x_test and y_test hold test data# Load x_train and y_trainhistory = model.fit(x_train, y_train,batch_size=32,epochs=5,validation_split=0.2)test_scores = model.evaluate(x_test, y_test, verbose=2)print('Test loss:', test_scores[0])print('Test accuracy:', test_scores[1])
To launch this job, the following Slurm script is used:
#!/bin/bash#SBATCH --job-name=name # job name#SBATCH --nodes=2 # number of nodes#SBATCH --ntasks-per-node=1 # number of MPI task per node#SBATCH --gres=gpu:8 # number of GPUs per node#SBATCH --cpus-per-task=32 # experiment to get the best value#SBATCH --time=00:20:00 # job length#SBATCH --output=%x_%j.out # std out filename#SBATCH --error=%x_%j.out # std err filename#SBATCH --exclusive # use entire machine# Activate the python virtual environmentsource /mnt/home/username/test_env/bin/activate# Set NCCL debugging and network interfaceexport NCCL_DEBUG=INFOexport OMP_NUM_THREADS=1echo "Starting job"datehostname# Start Python Processsrun python example.pyecho "Calculation complete"echo "Ending job"date
Once a job is submitted to Slurm, it is visible via Grafana and via the command line.