Kubeflow Training Operators
Easily run distributed training using PyTorch and MPI Operators with Kubeflow on CoreWeave Cloud
The Kubeflow project is dedicated to making deployments of Machine Learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.
Below you will find a general introduction to distributed training and Kubeflow training operators. Links to numerous open source, end-to-end examples built to be run on CoreWeave can be found under Examples.
CoreWeave Cloud supports running Kubeflow Training Operators to easily train your Machine Learning models across a variety of frameworks and backends. The diagram below shows some of the Training Operators that Kubeflow supports - the full list can be found in the Kubeflow official documentation as well as the source code.
Kubeflow Training Operators
It can be confusing at first, so it is important to understand the distinction between the different categories in this chart, and how it impacts the code.
The first operator used in this example is the PyTorch operator. As the name suggests, it uses PyTorch as the framework which includes packages like
torch.nn.parallel. For a more in depth explanation to distributed training with PyTorch see the PyTorch Distributed Overview documentation page.
Unlike the PyTorch operator, the MPI Operator is decoupled from underlying frameworks. It does this by leveraging a distributed deep learning training framework called Horovod. Horovod supports many frameworks by implementing simple wrappers like its
DistributedOptimizer. For a more in depth explanation, see the Introduction to Kubeflow MPI Operator and Industry Adoption blog post that was created when the MPI Operator was originally released.
A "distribution strategy" is the package that used in code that handles distribution. This is the main distinction between the three operators in the chart above. The distribution strategy is often implemented as a wrapper for different parts of your framework, such as the model and optimizer.
The "Backend" is the library that is used to facilitate the communication between devices that is required by distributed training. You will often not need to deal with these backends directly as the distribution strategy implements the use of them for you. The NVIDIA Collective Communication Library (NCCL) is optimized for the NVIDIA GPUs and Networking in CoreWeave Cloud so that is what we will be using in this example.
One of these parallelism techniques is called "data parallelism". This involves running an instance of the model is running on every GPU. During training, each batch is split into "micro-batches," where every GPU gets a single micro batch. The gradient is averaged across all micro-batches, and every instance of the model is updated.
When scaling up distributed training using data parallelism, it is considered best practice to follow the "Linear Scaling Rule," as presented in the Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour paper:
When the minibatch size is multiplied by
k, multiply the learning rate by
In the language of the rule, "minibatch" refers to the entire batch that then gets split into micro-batches.
This means that if the total number of processes is doubled - doubling the effective batch size - the learning rate should also be doubled. This prevents the accuracy from decreasing as the number of processes is scaled up, while also matching the training curves during scale up, which produces similar convergence rates in the same number of epochs.