> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# 3. Submit a training job with PyTorch or TensorFlow

> Submit multi-node, multi-GPU distributed training jobs using PyTorch torchrun on Slurm

In this section, you submit a training job to Slurm using the PyTorch `torchrun` utility. This step builds on [the previous section](/products/sunk/tutorials/train-on-sunk/2-submit-simple-job) by introducing a multi-node, multi-GPU job that more closely exemplifies a real-world distributed training scenario. By the end, you've submitted a PyTorch training job across multiple nodes and GPUs, and you understand how the equivalent workflow differs for TensorFlow.

This section helps you understand how to use Slurm with PyTorch or TensorFlow.

## Submit a training job using PyTorch

Use the [`torchrun` launcher](https://docs.pytorch.org/docs/stable/elastic/run.html#launcher-api) to schedule a multi-node, multi-GPU [PyTorch](https://pytorch.org) job on Slurm. The `torchrun` launcher is a superset of the launch utility from the [`torch.distributed`](https://docs.pytorch.org/docs/stable/distributed.html#) package, which provides PyTorch support and communication primitives for multi-process parallelism across multiple GPUs running on one or more nodes.

<Info>
  The directives for Slurm and the parameters for `torchrun` can be confused with one another. To keep them straight, remember this distinction:

  * Slurm **allocates the nodes** to be used for PyTorch.
  * `torchrun` **spawns the jobs** within a single node.
</Info>

## Example using PyTorch

The following annotated example script shows a multi-node distributed training job using PyTorch. The goal is to define the allocation and ensure that the `srun` command runs once on each node. Replace `[USERNAME]` with your username.

```bash theme={"system"}
#!/bin/bash

#SBATCH --job-name=pytorch_test
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=1
#SBATCH --time=20:00
#SBATCH --exclusive

# Activate the Python virtual environment
source /mnt/home/[USERNAME]/test_env/bin/activate

# Set NCCL debugging and the network interface
export NCCL_DEBUG=INFO
export OMP_NUM_THREADS=1

# Set the RDZV_HOST variable to the hostname
export RDZV_HOST=$(hostname)

# Generate a reasonably unique number for the communication port.
export RDZV_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))

echo "running on host $RDZV_HOST using port $RDZV_PORT"

echo "Starting job"
date
hostname

# Start the PyTorch Process
srun python3 \
    -m torch.distributed.run \
    --rdzv-id=$SLURM_JOB_ID \
    --rdzv-backend=c10d \
    --rdzv-endpoint="$RDZV_HOST:$RDZV_PORT" \
    --nnodes=2 \
    --nproc-per-node=8 \
    /mnt/home/[USERNAME]/tnet2/tnet2/train_c3_cw.py

echo "Calculation complete"

echo "Ending job"
date
```

The following annotated version of the script shows the same code:

```bash title="Example script with annotations" theme={"system"}
#!/bin/bash

#SBATCH --job-name=pytorch_test
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=1 # Critical configuration. Sets 1 torchrun call per node.
#SBATCH --time=20:00
#SBATCH --exclusive

# Activate the Python virtual environment
source /mnt/home/[USERNAME]/test_env/bin/activate

# Set NCCL debugging and the network interface
export NCCL_DEBUG=INFO
# Environment variables used by PyTorch.
# These `export` commands are executed by the shell on the primary compute node
# before the `srun` command is executed, so the variables are only set once.
export OMP_NUM_THREADS=1

# Set the RDZV_HOST variable to the hostname
export RDZV_HOST=$(hostname) # RDZV_HOST is the hostname of the primary compute node.

# Generate a reasonably unique number for the communication port.
# RDZV_PORT is created to be a unique port name for PyTorch to use.
# Using the Slurm JobID (`$SLURM_JOBID`) as part of the port number reduces the chance
# of a job clash, should more than one job be running on the same node.
export RDZV_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))

echo "running on host $RDZV_HOST using port $RDZV_PORT"

echo "Starting job"
date
hostname

# Start the PyTorch Process
srun python3 \
# Invoking `python -m torch.distributed.run` is equivalent to invoking `torchrun`.
# Torchrun should be invoked using the same arguments on each node that is
# participating in the training run. This srun command is run once on each
# node in the allocation by Slurm. In this example, it is run twice, using
# the same parameters. The host that is named in `--rdzv-endpoint` opens a
# port to listen from the other host.
    -m torch.distributed.run \
    --rdzv-id=$SLURM_JOB_ID \
    --rdzv-backend=c10d \
    # The Rendezvous Backend in PyTorch (parameters beginning with `--rdzv-`)
    # coordinates how distributed workers find one another before beginning a
    # distributed training job.
    --rdzv-endpoint="$RDZV_HOST:$RDZV_PORT" \
    --nnodes=2 \
    --nproc-per-node=8 \
    /mnt/home/[USERNAME]/tnet2/tnet2/train_c3_cw.py

echo "Calculation complete"

echo "Ending job"
date
```

Most of these parameters are similar to those used in [the previous section](/products/sunk/tutorials/train-on-sunk/2-submit-simple-job), with a few notable exceptions:

* `#SBATCH --ntasks-per-node=1`: This `sbatch` parameter is **important** when running multi-GPU PyTorch jobs with Slurm.
  * Slurm manages nodes, and `torchrun` manages what runs on them. If you omit this line, the default number of tasks per node is `1`.
  * However, setting this number higher might cause problems. As a best practice in this configuration, explicitly set the number of tasks to `1`, as shown here.
  * For most other applications, set this number to the total number of desired tasks per node.
* `export OMP_NUM_THREADS=1`: An environment variable that sets the number of threads that run within an OpenMP job.
  * Here, it's set to `1` explicitly.
  * If you don't set it, PyTorch relies on OpenMP's default behavior, which may use as many threads as there are physical cores.
  * Explicitly setting this parameter avoids accidentally launching more threads than there are processors on a node.

<Info>
  Invoking `python -m torch.distributed.run`, as the preceding script does, is equivalent to invoking `torchrun`. Invoke `torchrun` using the same arguments on each node that participates in the training run, which the preceding script ensures.
</Info>

### Rendezvous backend

The [Rendezvous Backend in PyTorch](https://docs.pytorch.org/docs/stable/elastic/run.html#launcher-api) coordinates how distributed workers find one another before beginning a distributed training job.

As specified in [the Rendezvous Backend documentation](https://docs.pytorch.org/docs/stable/elastic/run.html#note-on-rendezvous-backend), multi-node training requires the following parameters:

* `--rdzv-id`: A unique job ID, shared by all nodes participating in the job.
* `--rdzv-backend`: An implementation of the [`torch.distributed.elastic.rendezvous.RendezvousHandler`](https://docs.pytorch.org/docs/stable/elastic/rendezvous.html#torch.distributed.elastic.rendezvous.RendezvousHandler) interface.
* `--rdzv-endpoint`: The endpoint where the rendezvous backend runs, usually in `host:port` format.

The preceding script implements these parameters as follows:

```text title="Rendezvous parameters" theme={"system"}
    --rdzv-id=$SLURM_JOB_ID \
    --rdzv-backend=c10d \
    --rdzv-endpoint="$RDZV_HOST:$RDZV_PORT" \
```

In this case, the `--rdzv-backend` is the `c10d` backend, the `--rdzv-endpoint` is set to the `$RDZV_HOST:$RDZV_PORT` combination generated earlier in the script, and the `--rdzv-id` is set to `$SLURM_JOB_ID`, which is a single unique identifier for the entire job and is the same on both nodes.

The following backends are supported without additional setup:

* `c10d` (recommended)
* `etcd-v2` and `etcd` (legacy)

<Info>
  To use `etcd-v2` or `etcd`, set up an etcd server with the `v2` API enabled using the `--enable-v2` parameter.
</Info>

### Launch the job in a container

To launch the same job inside a container, use a similar script with a few key differences:

* Don't use a virtual environment.
* A container already has the right components installed and configured, so the script doesn't need to handle these installations.
* The `srun` command specifies the container by targeting its squash file.

Replace `[USERNAME]` with your username.

```bash title="Example srun using a container" theme={"system"}
srun --container-image=/mnt/home/[USERNAME]/nightly-torch.sqsh \
    --container-mounts=/mnt/home:/mnt/home \
    --no-container-remap-root
```

You can load this container image from a shared filesystem or pull it from a registry.

<Info>
  For more information about developing containers, see [Develop containers with Pyxis](/products/sunk/tutorials/train-on-sunk/1-set-up-slurm-cluster#develop-containers-with-pyxis) in the previous section.
</Info>

## TensorFlow launch utilities with Slurm

If you use TensorFlow instead of PyTorch, the launch workflow differs. With TensorFlow, some of the logic from the Slurm script moves into the Python code.

<Warning>
  The following examples aren't full Python code for TensorFlow and don't run as presented. They are for example purposes only.
</Warning>

```python theme={"system"}
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

tf.keras.backend.clear_session()
slurm_resolver = tf.distribute.cluster_resolver.SlurmClusterResolver(port_base=15000)
communication = tf.distribute.experimental.CommunicationImplementation.NCCL
mirrored_strategy = tf.distribute.MultiWorkerMirroredStrategy( \
    cluster_resolver=slurm_resolver, communication_options=communication)
with mirrored_strategy.scope():
    inputs = keras.Input(shape=(392,), name='img')
    x = layers.Dense(32, activation='relu')(inputs)
    x = layers.Dense(32, activation='relu')(x)
    outputs = layers.Dense(6)(x)

    model = keras.Model(inputs=inputs, outputs=outputs, name='name')
    model.compile(.......) # model options here

# x_test and y_test hold test data
# Load x_train and y_train

history = model.fit(x_train, y_train,
                    batch_size=32,
                    epochs=5,
                    validation_split=0.2)
test_scores = model.evaluate(x_test, y_test, verbose=2)
print('Test loss:', test_scores[0])
print('Test accuracy:', test_scores[1])
```

To launch this job, use the following Slurm script. Replace `[USERNAME]` with your username.

```bash theme={"system"}
#!/bin/bash
#SBATCH --job-name=name     # job name
#SBATCH --nodes=2           # number of nodes
#SBATCH --ntasks-per-node=1 # number of MPI task per node
#SBATCH --gres=gpu:8        # number of GPUs per node
#SBATCH --cpus-per-task=32  # experiment to get the best value
#SBATCH --time=00:20:00     # job length
#SBATCH --output=%x_%j.out  # std out filename
#SBATCH --error=%x_%j.out   # std err filename
#SBATCH --exclusive         # use entire machine

# Activate the python virtual environment
source /mnt/home/[USERNAME]/test_env/bin/activate

# Set NCCL debugging and network interface

export NCCL_DEBUG=INFO
export OMP_NUM_THREADS=1

echo "Starting job"
date
hostname

# Start Python Process
srun python example.py

echo "Calculation complete"

echo "Ending job"
date
```

After you submit a job to Slurm, it's visible [through Grafana](/products/sunk/tutorials/train-on-sunk/4-monitor-jobs) and from the command line. In the next section, you monitor running jobs using these methods.
