Train ResNet-50 with ImageNet

Use Kubeflow training operators to perform distributing training of the popular ResNet-50 model.

Originally published in 2015, the ResNet model architecture achieved state-of-the-art results on image classification datasets such as ImageNet. Since then, both ResNet and ImageNet have been used in numerous papers to test the performance of large scale training techniques, such as the Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour paper.

In this example, we'll use the torchvision library along with multiple Kubeflow training operators to perform distributed training of ResNet-50 on ImageNet.

Source code

Throughout the rest of this document, referenced files may be found in CoreWeave's kubernetes-cloud repo in the kubeflow/training-operator/resnet50 folder.

In this folder, there are two Python files, two Docker files, and one folder named k8s/.

The Python files implement the training scripts for the PyTorch and MPI Operators. The Dockerfiles create the respective images used by the Training Operators. Finally, the k8s/ folder contains all of the YAML files that define the Kubernetes resources needed to run this example on CoreWeave Cloud.

Util script

Since both training operators are using PyTorch as the framework, the common functions are reused and located in util.py. This common functionality includes the training and test loops, along with loading the ImageNet dataset.

PyTorch Operator Training script

The training script that will be used by the PyTorch Operator is in resnet50_pytorch.py.

The model definition, train loop, and test loop are all standard for PyTorch. The important pieces that set up distributed training are found in the main function.

First, the distributed backend is initialized. This is where the distribution strategy is told which backend to use:

if should_distribute():
    print('Using distributed PyTorch with {} backend'.format(args.backend))
    dist.init_process_group(backend=args.backend)
    print(f"Current rank: {dist.get_rank()}\tlocal rank: {LOCAL_RANK}\ttotal world size:  {dist.get_world_size()}")

Note

This script is run by each device used in distributed training. A process's "rank" is a unique integer ranging from 0 to the world size. The world size is the total number of processes that are part of the distributed training. The local rank maps a process to the GPU on the device it's running on.

Next, the model is wrapped with a Distributed Data Parallel object. This allows you to treat the model the same way as you would for non-distributed training within the training and test loops. Once the model is wrapped, it will distribute actions across all instances of the model like computing an output on a batch of data.

if is_distributed():
    model = nn.parallel.DistributedDataParallel(model)

Following the linear scaling rule, the learning rate is also scaled by the number of processes:

lr_scaler = dist.get_world_size() if is_distributed() else 1
optimizer = optim.SGD(model.parameters(), lr=args.lr * lr_scaler, momentum=args.momentum)

When the training and test loops are completed, the model checkpoint needs to be saved. Since the script is being run synchronously across multiple devices, the saving of a checkpoint requires special logic. This prevents each process from trying to save a checkpoint into the same file. Saving the checkpoint is reserved for the "root" process, which will always have a rank of 0.

if args.model_dir and dist.get_rank() == 0:
    args.model_dir.mkdir(parents=True, exist_ok=True)
    torch.save(model.state_dict(), args.model_dir / "mnist_cnn.pt")

MPI Operator Training script

The training script that will be used by the MPI Operator is in resnet50_horovod.py.

This script also uses PyTorch as its framework, but utilizes Horovod to handle distribution.

Additional Resources

Refer to the Horovod with PyTorch documentation for a general overview of adding Horovod to PyTorch scripts.

Like the PyTorch script, the model definition, train loop, and test loop are standard PyTorch implementations. The set up with Horovod is very similar to setting up PyTorch's torch.distributed.

First, the distribution strategy is initialized:

hvd.init()

Next, the optimizer is created with its learning rate scaled by the world size.

Note

Horovod and this training script support using AdaSum, a novel algorithm for gradient reduction that eliminates the need to follow the Linear Scaling Rule.

After the model and optimizer are created, it must be broadcast to all the different processes. This is required to ensure that all of the workers have consistent initialization at the beginning of training.

hvd.broadcast_parameters(model.state_dict(), root_rank=0)
hvd.broadcast_optimizer_state(optimizer, root_rank=0)

The last thing that must be created before starting the training loop is Horovod's DistributedOptimizer.

optimizer = hvd.DistributedOptimizer(optimizer,
                                     named_parameters=model.named_parameters(),
                                     compression=compression,
                                     op=hvd.Adasum if args.use_adasum else hvd.Average,
                                     gradient_predivide_factor=args.gradient_predivide_factor)

Dockerfiles

In order to run the training scripts from the Kubeflow Training Operators, the Operators must be containerized, for which we will use Docker. The Dockerfiles for the PyTorch and MPI Operators are found in the Dockerfile.pytorch and Dockerfile.mpi files, respectively.

Both Dockerfiles use NVIDIA's PyTorch Docker image. This Docker image comes with many of the necessary libraries preinstalled, including NVIDIA NCCL, CUDA, and OpenMPI.

PyTorch Operator YAML

The PyTorchJob Kubernetes resource is defined in k8s/imagenet-pytorchjob.yaml.

Within this resource definition, there are two important specs: the master and the worker. The container defined by the master spec will be the process with a rank of 0, also referred to as the root.

The PyTorch Job will also set up all the environment variables that are needed by torchrun and dist to set up distributed training. MASTER_ADDR and MASTER_PORT will point at the pod defined by the master spec.

In this example, the master and worker containers all run the same script using the same arguments. Both containers take the resources of an entire node, which includes 8 GPUs. The torchrun command is used to spawn one process per GPU on that container.

A shared PVC is also attached to all the containers, so that the dataset can be accessed by all the workers.

Note

If you want to scale up the number of GPUs used for distributed training, all you would need to do is increase the number of worker replicas.

MPI Operator YAML

The MPIJob Kubernetes resource is defined in k8s/imagenet-mpijob.yaml.

Similarly to the PyTorchJob, the MPIJob defines a Launcher and a Worker spec. However, there is one main difference between the specs found in MPIJob and those found in PyTorchJob.

MPIJob uses an mpirun command in the launcher, but no commands in the worker containers. This is because the launcher container is used to orchestrate the running of the workers, and will connect to the worker containers to run the training script.

Note

The launcher container has low resource requests with no GPU, since it will not be running any of the actual training.

Like the PyTorchJob, scaling up the number of GPUs used with the MPIJob can be done by increasing the number of workers. When doing this, the value for the total number of processes will always need to be updated in the mpirun command used on the launcher container. This value is passed using the -np flag.

Important

The names of MPIJobs are unique. An old job must be deleted before a new one can be created with the same name.

Setup

Note

This guide assumes that you have already followed the process to set up the CoreWeave Kubernetes environment. If you have not done so already, follow our Getting Started guide before proceeding with this guide.

Deploy the PVC

A PVC storage volume will be used to store the dataset and model checkpoints. The PVC is defined in k8s/pvc.yaml. Use kubectl to deploy it:

kubectl apply -f k8s/pvc.yaml

FileBrowser (Optional)

This application allows you to share out and access your PVC using an easy application that lets you upload and download files and folders. You can find and deploy the Filebrowser application from the application Catalog on CoreWeave Cloud.

It is recommended that the name you give the Filebrowser application be very short, or you will run into SSL CNAME issues. We recommend using the name kubeflow.

When configuring the application instance, select the kubeflow-mnist PVC that you created earlier. Make sure that you actually add your PVC to the filebrowser list of mounts!

Build and push the Docker images

Each of the training operators require their respective Docker images to be built and pushed. From within the kubernetes-cloud/kubeflow/training-operator/resnet50 directory, build and push the Docker images.

Note

This example assumes a public Docker registry. To use a private registry, an imagePullSecret must be defined.

Log in to Docker Hub and set DOCKER_USER to your Docker Hub username. When running the following commands, be sure to replace the example username with your Docker Hub username.

docker login
export DOCKER_USER=<username>

Build and tag both images.

Important

The default Docker tag is latest. Using this tag is strongly discouraged, as containers are cached on the nodes and in other parts of the CoreWeave stack. Always use a unique tag, and never push to the same tag twice. Once you have pushed to a tag, do not push to that tag again.

To avoid using the latest tag, a simple versioning scheme is used beginning with the tag 1 for the first iteration of the image.

docker build -t $DOCKER_USER/pytorch_dist_resnet50:1 -f Dockerfile.pytorch .
docker build -t $DOCKER_USER/pytorch_mpi_resnet50:1 -f Dockerfile.mpi .

Push both images to Docker Hub.

docker push $DOCKER_USER/pytorch_dist_resnet50:1
docker push $DOCKER_USER/pytorch_mpi_resnet50:1

Create the secrets

Create a Kaggle secret

The ImageNet dataset is publicly available via a Kaggle Object Localization Challenge. To download the dataset using the Kaggle CLI, first create a Kaggle account.

After you have signed in to your new account, navigate to the Kaggle competition and accept the competition rules. When all of that is done, you should be be able to see a sample of the data in your browser:

Once your Kaggle account has access to the ImageNet dataset, create an API token by navigating to your profile page (https://www.kaggle.com/<username>/account). Click "Create API Token." This will trigger a download of a file named kaggle.json.

Next, retrieve the key value in kaggle.json, encode it using base64, then copy the encoded value into k8s/kaggle-secret.yaml.

$ cat kaggle.json
{"username":"navarreprattcw","key":"example-key-1234"}

$ echo -n "example-key-1234" | base64
ZXhhbXBsZS1rZXktMTIzNA==

When complete, your k8s/kaggle-secret.yaml should look similar to the following.

apiVersion: v1
data:
  token: ZXhhbXBsZS1rZXktMTIzNA==
kind: Secret
metadata:
  name: kaggle-token-secret
type: Opaque

Apply the changed manifest to the cluster using kubectl:

kubectl apply -f k8s/kaggle-secret.yaml

(Optional) Create a Weights and Biases secret

Note

This is optional for this tutorial.

Weights and Biases is a popular MLOps platform that helps track and visualize machine learning experiments.

If you would like to log metrics using Weights and Biases during this tutorial, then you will need to create a secret containing your wandb account token. To find your token, log in to your Weights and Biases account, then navigate to https://wandb.ai/authorize.

Encode your token using Base64, copy it into k8s/wandb-secret.yaml , and apply it to the cluster.

$ echo -n "example-wanbd-token" | base64
ZXhhbXBsZS13YW5kYi10b2tlbg==

When complete, your k8s/kaggle-secret.yaml should look similar to the following.

apiVersion: v1
data:
  token: ZXhhbXBsZS13YW5kYi10b2tlbg==
kind: Secret
metadata:
  name: wandb-token-secret
type: Opaque

Apply the changed manifest to the cluster using kubectl:

kubectl apply -f k8s/wandb-secret.yaml

Download the dataset

Downloading all of the required files for the ImageNet dataset is done by a Kubernetes job defined in k8s/imagenet-download-job.yaml.

This job uses the Kaggle secret to download the dataset via the Kaggle CLI directly into the PVC that was just created.

Important

The Kaggle token will be used from the secret, but the username needs to be updated in k8s/imagenet-download-job.yamlon line 29.

Once the Kaggle username is updated, start the job using kubectl:

kubectl apply -f k8s/imagenet-download-job.yaml

Run distributed training

Before running the Training Operators, replace the Docker image names in the YAML configuration files with the images that were just built and pushed.

You may either manually edit the files, or do so using sed by running the following commands:

sed -ri "s/^(\s*)(image\s*:\s*navarrepratt\/pytorch_dist_resnet50:1\s*$)/\1image: $DOCKER_USER\/pytorch_dist_resnet50:1/" k8s/imagenet-pytorchjob.yaml

sed -ri "s/^(\s*)(image\s*:\s*navarrepratt\/pytorch_mpi_resnet50:1\s*$)/\1image: $DOCKER_USER\/pytorch_mpi_resnet50:1/" k8s/imagenet-mpijob.yaml

Note

If you are not using Weights and Biases, you must remove the two Weights and Biases flags in the command for all of the containers.

PyTorch Operator

Deploy the PyTorchJob using kubectl:

kubectl apply -f k8s/imagenet-pytorchjob.yaml

Once it is created, you can view information about it using kubectl get:

$ kubectl get pytorchjob

NAME                      STATE     AGE
pytorch-dist-mnist-nccl   Created   52s

You can use kubectl get pods to watch the Pods start up and run:

$ kubectl get pods -w -l job-name=imagenet-pytorch

NAME                        READY   STATUS      RESTARTS   AGE
imagenet-pytorch-master-0   0/1     Completed   0          2m55s
imagenet-pytorch-worker-0   0/1     Completed   0          2m55s

Use kubectl logs to view logs for any of the Pods:

kubectl logs -f imagenet-pytorch-master-0

The model checkpoint will be saved to the PVC at the path kubeflow-resnet50/pytorch/checkpoints/resnet50_imagenet.pt.

MPI Operator

Use kubectl to deploy the MPIJob resource:

kubectl apply -f k8s/imagenet-mpijob.yaml

Once it is created, you can view information about it using kubectl get:

$ kubectl get mpijob
                               
NAME                    AGE
imagenet-16gpu-mpijob   3s

You can use kubectl get pods to watch the Pods start up and run:

$ kubectl get pods -w -l training.kubeflow.org/job-name=imagenet-16gpu-mpijob

NAME                                   READY   STATUS              RESTARTS   AGE
imagenet-16gpu-mpijob-launcher-6cd4n   0/1     ContainerCreating   0          2m4s
imagenet-16gpu-mpijob-worker-0         1/1     Running             0          2m4s

Note

The launcher Pod may fail a couple of times while the worker Pod is still starting up. This is an expected race condition, which often happens if the Docker image is already cached on the launcher machine, causing it to start up much more quickly than the worker Pod. Once the worker Pod is fully created, the launcher will be able to communicate with it via SSH.

To follow the logs, run the following command:

$ kubectl logs -l job-name=imagenet-16gpu-mpijob-launcher -f

The model checkpoint will be saved to the PVC at the path kubeflow-mnist/mpi/checkpoints/resnet50_imagenet.pt.

Performance analysis

Using the chart's created on Weights and Biases we can analyze the scaling efficiencies of both of the training operators.

Note

The hyperparameters used haven't been properly tuned to produce a "state of the art" version of ResNet-50. TorchVision publishes "training recipes" for their pretrained weights. You can see all the hyperparameters they used for ResNet-50 here.

The data in the chart below shows samples per second numbers throughout 3 epochs of training on each GPU. This means that the total samples per second is the value shown in the chart times the number of GPUs used. Each line represents a different combination of PytorchJob and MPIJob and half-full A40 nodes (8 and 16 GPUs).

As you can see, the per-GPU throughput hardly drops when moving to two nodes. This means the total throughput is almost doubled when using twice as many GPUs. You can expect the scaling efficiency to decrease as you increase the model size and total number of GPUs.

Horovod's own benchmarks report a 90% scaling efficiency when scaling up to 512 total GPUs when training ResNet-101.

Note

Training with 8 GPUs has twice as many steps in the above chart because during each step only 8 * 256 images are processed. When training with 16 GPUs, two times as many images are processed in each batch.

Clean up

Once you are finished with everything, you can delete all resources using the kubectl delete command:

kubectl delete -R -f k8s/

Note

If you created the optional Filebrowser application, you will need to delete it via the CoreWeave Cloud Applications page before the PVC can be deleted.

Last updated