Fine-Tune GPT-NeoX-20B with Argo Workflows

Use Argo Workflows to run a full pipeline which ends with distributed fine-tuning of GPT-NeoX-20B

Similarly to the Fine-tuning Machine Learning Models tutorial, the following walkthrough provides an example of using Argo Workflows to fine-tune a smaller model (GPT-J) on a smaller dataset. If you are new to fine-tuning and Argo Workflows, this is a great place to start.

This example uses two A100 nodes (16 total GPUs) using NVIDIA's NVLINK and Infiniband technologies for highly performant distributed training.

Optional A40 configuration

If A100 resources aren't available in the selected region, A40 nodes can be substituted. See the A40 option later in this tutorial and check the Debugging section for more tips.

Source code

Throughout the rest of this document, referenced files may be found in CoreWeave's kubernetes-cloud repo in the kubeflow/training-operator/gpt-neox folder.

In this folder, you will find numbered YAML files:

Example

01-pvc.yaml
02-finetune-role.yaml
03-wanbd-secret.yaml
04-finetune-workflow.yaml

These should be deployed in the order in which they are numbered. The first three YAML files (01-pvc.yaml, 02-finetune-role.yaml, and 03-wanbd-secret.yaml) deploy the Kubernetes resources that are required by the Argo workflow, which is deployed using the fourth file (04-finetune-workflow.yaml).

Understanding the Fine-tune Workflow

The Argo workflow for fine-tuning is defined in the 04-finetune-workflow.yaml file. This file consists of three important sections:

workflow parameters,
the directed-acyclic graph (DAG) definition,
and the step definitions.

Parameters

At the top of the 04-finetune-workflow.yaml file, all of the Workflow parameters and their default values are defined. If none of these are changed, the Workflow will download and tokenize the Hackernews subset of The Pile dataset before fine-tuning GPT-NeoX-20B on it using two A100_NVLINK_80GB nodes in the LAS1 region.

If you have already downloaded the model weights, dataset, and/or tokenized the dataset, you can skip those steps by changing the respective parameters to false:

download_checkpoint
download_dataset
tokenize_dataset

Note

The screenshot above shows these steps being skipped.

The fine-tuning stage uses Weights and Biases for extra logging. This is controlled by the three wandb parameters. To avoid using Weights and Biases, you can set the values of these parameters to an empty string.

DAG definition

The DAG is defined in the Workflow file under the main template. The graph is organized based on the dependencies of all the stages. For example, the two download stages don't have any dependencies, so they both run in parallel right away.

Note

It is not demonstrated in this example, but it is possible to have more advanced logic around dependencies, such as running certain stages based on the completion status of another stage.

Source code

The code that performs the dataset tokenization and fine-tuning is straight from Eluther AI's gpt-neox repo. This code is built into the Docker image that is built from CoreWeave's ml-containers repo.

Training config

The config file containing all of the training-related parameters is stored within a Kubernetes ConfigMap. If you would like to adjust these parameters, the ConfigMap can be edited within the 04-finetune-workflow.yaml file in the create-configmap stage definition.

Fine-tuning resources

By default, this Workflow uses two A100 nodes with NVLINK and RDMA over InfiniBand for extremely fast distributed training. These resources are defined in the Worker spec of the MPIJob at the bottom of the Workflow's YAML manifest:

Example

resources:
  requests:
    cpu: 48
    memory: 256Gi
    nvidia.com/gpu: 8
    rdma/ib: 1
  limits:
    nvidia.com/gpu: 8
    rdma/ib: 1

RDMA over InfiniBand

As is shown in this example, the rdma/ib resource requests that Remote Direct Memory Access (RDMA) be performed using InfiniBand. RDMA allows for direct memory access from the memory of one computer into that of another without involving the Operating System of either machine, which is accomplished using InfiniBand packets over Ethernet.

Requesting this resource offers a big boost to distributed training performance, however it is currently only available for A100 and H100 GPU node types on CoreWeave Cloud. Learn more about InfiniBand on CoreWeave Cloud.

Setup

Before running the Workflow, a few things need to be created in your namespace.

Note

This guide assumes that you have already followed the process to set up the CoreWeave Kubernetes environment. If you have not done so already, follow our Getting Started guide before proceeding with this guide.

Argo Workflows

To run an Argo workflow, first deploy the Argo Workflows application in your namespace via the CoreWeave's application Catalog.

Additional Information

For more information on Argo Workflows, see Workflows.

PVC

The Argo workflow uses Persistent Volume Claim's (PVCs) to store the dataset and model checkpoints. The PVCs are defined in 01-pvc.yaml and be deployed with kubectl:

Example

$
kubectl apply -f 01-pvc.yaml

Optional

You can deploy a FileBrowser application attaching the newly created PVCs to be able to inspect their contents in your browser.

Fine-tune role

This Argo Workflow involves creating new Kubernetes resources in your environment: a ConfigMap, and an MPIJob. In order to do create these, the Workflow needs to run as a Service Account with the necessary permissions granted to it.

The finetune service account and its permissions that will later be used by the workflow is defined in the 02-finetune-role.yaml. Apply it to your namespace with kubectl:

Example

$
kubectl apply -f 02-finetune-role.yaml

Weights and Biases secret

If you would like to take advantage of Weights and Biases logging during fine-tuning, create a Secret that contains your WandB account key. To do this, first acquire your key from Weights and Biases and encode it using base64.

Example

$
echo -n "example-wanbd-key" | base64

ZXhhbXBsZS13YW5iZC1rZXk=

Then, copy the encoded value into line 3 of 03-wandb-secret.yaml.

When complete, the file should look like this:

Example

apiVersion: v1
data:
  token: ZXhhbXBsZS13YW5iZC1rZXk=
kind: Secret
metadata:
  name: wandb-token-secret
type: Opaque

Once the file is updated with your encoded account key, apply it to your namespace with kubectl:

Example

$
kubectl apply -f 03-wandb-secret.yaml

Run the Workflow

Once all of the necessary resources are created, submit the Workflow using the Argo CLI. If it is not already installed, follow Argo's installation instructions to install the CLI tool.

To submit the Workflow to the Argo server created earlier, use the argo submit command. The -p flag can be used to set the value for any of the parameters in the YAML file in line.

Example

$
argo submit 04-finetune-workflow.yaml \
    -p run_name=finetune_gpt_neox \
    --serviceaccount finetune

Once the Workflow is submitted, its progress may be monitored from the Argo Workflows Web UI, which can be accessed via the URL provided in the application's deployment page. Retrieve this page by navigating to the Applications page on CoreWeave Cloud, then clicking on the Argo application.

Pod logs may be acquired via CLI using kubectl logs <pod name>, or by clicking on the relevant stage in the Argo Workflows Web UI.

The logs from the fine-tuning training script are available from the launcher Pod. They can be accessed via kubectl:

Example

$
kubectl logs finetune-gpt-neox-n6mnd-mpijob-launcher-xz98s

Note

Once complete, the fine-tuned model checkpoint will be saved in the neox-checkpoints PVC. The path is defined as a workflow parameter and defaults to pvc://neox-checkpoints/20B_finetuned_checkpoint.

Deleting the Argo Workflow alone won't remove all of the resources. The mpijob resource that was used for fine-tuning, and the configmap resource will still exist. To delete them, target them specifically using kubectl delete:

Finally, remove the resources that were created prior to running the Workflow. You can delete the Argo Workflows deployment through the CoreWeave's Applications UI. The PVCs, fine-tune Service Account role, and Weights and Biases secret can be deleted by targeting the files that created them with kubectl delete:

Clean up

Once the Workflow has finished, the Pods used for all Workflow stages will move to a Completed state in order to keep the logs available for viewing. At this point, they are no longer using any compute resources, so will not incur any cost.

The easiest way to clean all of these Pods up is by deleting the Workflow run from the Argo Workflows Web UI. You may also delete them manually using kubectl delete pod.

Example

$
kubectl delete configmap neox-training
$
kubectl delete mpijob <mpijob-name-created-by-argo>

Note

The unique name of the mpijob can be acquired using kubectl get mpijob.

Example

$
kubectl delete -f 03-wandb-secret.yaml
$
kubectl delete -f 02-finetune-role.yaml
$
kubectl delete -f 01-pvc.yaml

Debugging Race Conditions

The GPT-NeoX code can hang indefinitely if both worker Pods are not up and running when the main container in the launcher begins to run.

This tutorial uses an init container in the launcher that sleeps for 60 seconds to prevent this problem. In most cases, this delay is sufficient, but the code can hang indefinitely if the worker Pods take longer than 60 seconds to spin up or remain in a Pending state due to a lack of resource availability. For example, there may not be enough A100s available in the selected region. If so, consider using the A40 option below.

To detect if the launcher is properly connected to both of the worker Pods, check the worker logs with kubectl logs.

If the most recent line is Accepted publickey for root..., then the launcher is connected.
On the other hand, if the most recent log is Disconnected..., then the launcher isn't currently connected.

Here is an example of a launcher properly connected to the worker:

Example

$
kubectl logs finetune-gpt-neox-n6mnd-mpijob-worker-1

Server listening on 0.0.0.0 port 22.
Server listening on :: port 22.

Accepted publickey for root from 10.145.231.160 port 38432 ssh2: ECDSA SHA256:mq7qxkWCmx7Srl2iavbJ0Dk7KsBriu1UvYnUCcruAts

A40 option

This tutorial uses two Nodes with A100 80GB cards and Infiniband, for a total of 16 GPUs. This delivers very high performance, but you may find that you aren't able to schedule that many A100s on-demand, leaving the worker Pods stuck in the Pending state. If you require A100 GPUs that you can't schedule on-demand, please reach out to CoreWeave support and ask about Reserved Instances.

Alternatively, you can use A40s by changing the default parameters when submitting the workflow to argo , as shown.

Example

$
argo submit 04-finetune-workflow.yaml \
    -p run_name=finetune_gpt_neox \
    -p trainer_gpu=A40 \
    -p use_ib=false \
    -p micro_batch_size=4 \
    -p  gradient_accumulation_steps=192 \
    --serviceaccount finetune

The main differences in this workflow are:

GPU affinity of the fine-tune job has been changed to A40
the resource request/limit of rdma/ib: 1 is changed to rdma/ib: 0
the GPT-NeoX config reflect the smaller memory of the A40

More information

See these resources for more information:

Source code​

Understanding the Fine-tune Workflow​

Parameters​

DAG definition​

Source code​

Training config​

Fine-tuning resources​

RDMA over InfiniBand​

Setup​

Argo Workflows​

PVC​

Fine-tune role​

Weights and Biases secret​

Run the Workflow​

Clean up​

Debugging Race Conditions​

A40 option​

More information​

Source code

Understanding the Fine-tune Workflow

Parameters

DAG definition

Source code

Training config

Fine-tuning resources

RDMA over InfiniBand

Setup

Argo Workflows

PVC

Fine-tune role

Weights and Biases secret

Run the Workflow

Clean up

Debugging Race Conditions

A40 option

More information