Skip to main content

Kubeflow Training Operators

Easily run distributed training using PyTorch and MPI Operators with Kubeflow on CoreWeave Cloud

The Kubeflow project is dedicated to making deployments of Machine Learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.

Below you will find a general introduction to distributed training and Kubeflow training operators. Links to numerous open source, end-to-end examples built to be run on CoreWeave can be found under Examples.

Kubeflow Training Operators

CoreWeave Cloud supports running Kubeflow Training Operators to easily train your Machine Learning models across a variety of frameworks and backends. The diagram below shows some of the Training Operators that Kubeflow supports - the full list can be found in the Kubeflow official documentation as well as the source code.

Kubeflow Training Operators

It can be confusing at first, so it is important to understand the distinction between the different categories in this chart, and how it impacts the code.

PyTorch Operator

The first operator used in this example is the PyTorch operator. As the name suggests, it uses PyTorch as the framework which includes packages like torch.distributed and torch.nn.parallel. For a more in depth explanation to distributed training with PyTorch see the PyTorch Distributed Overview documentation page.

MPI Operator

Unlike the PyTorch operator, the MPI Operator is decoupled from underlying frameworks. It does this by leveraging a distributed deep learning training framework called Horovod. Horovod supports many frameworks by implementing simple wrappers like its DistributedOptimizer. For a more in depth explanation, see the Introduction to Kubeflow MPI Operator and Industry Adoption blog post that was created when the MPI Operator was originally released.

Frameworks

A "framework" is what is used to define your machine learning model. The most popular frameworks are TensorFlow and PyTorch, but Kubeflow also supports other frameworks such as MXNet and XGBoost.

Distribution strategy

A "distribution strategy" is the package that used in code that handles distribution. This is the main distinction between the three operators in the chart above. The distribution strategy is often implemented as a wrapper for different parts of your framework, such as the model and optimizer.

Backend

The "Backend" is the library that is used to facilitate the communication between devices that is required by distributed training. You will often not need to deal with these backends directly as the distribution strategy implements the use of them for you. The NVIDIA Collective Communication Library (NCCL) is optimized for the NVIDIA GPUs and Networking in CoreWeave Cloud so that is what we will be using in this example.

Data parallelism

The provided examples use models that are trained on very large datasets including ImageNet and The Pile. Utilizing parallelism techniques can greatly reduce the total training time.

One of these parallelism techniques is called "data parallelism". This involves running an instance of the model is running on every GPU. During training, each batch is split into "micro-batches," where every GPU gets a single micro batch. The gradient is averaged across all micro-batches, and every instance of the model is updated.

When scaling up distributed training using data parallelism, it is considered best practice to follow the "Linear Scaling Rule," as presented in the Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour paper:

When the minibatch size is multiplied by k, multiply the learning rate by k.

note

In the language of the rule, "minibatch" refers to the entire batch that then gets split into micro-batches.

This means that if the total number of processes is doubled - doubling the effective batch size - the learning rate should also be doubled. This prevents the accuracy from decreasing as the number of processes is scaled up, while also matching the training curves during scale up, which produces similar convergence rates in the same number of epochs.

How-to guides and tutorials

For examples using Kubeflow Training Operator on CoreWeave Cloud, see Kubeflow Training Operator Guides.