Fine-tune GPT-NeoX-20B with Argo Workflows

Use Argo Workflows to run a full pipeline which ends with distributed fine-tuning of GPT-NeoX-20B.
Similarly to the Fine-tuning Machine Learning Models tutorial, the following walkthrough provides an example of using Argo Workflows to fine-tune a smaller model (GPT-J) on a smaller dataset. If you are new to fine-tuning and Argo Workflows, this is a great place to start.
This example uses two A100 nodes (16 total GPUs) using NVIDIA's NVLINK and Infiniband technologies for highly performant distributed training.
Optional A40 configuration
If A100 resources aren't available in the selected region, A40 nodes can be substituted. See the A40 option later in this tutorial and check the Debugging section for more tips.

Source code

Throughout the rest of this document, referenced files may be found in CoreWeave's kubernetes-cloud repo in the kubeflow/training-operator/gpt-neox folder.
In this folder, you will find numbered YAML files:
These should be deployed in the order in which they are numbered. The first three YAML files (01-pvc.yaml, 02-finetune-role.yaml, and 03-wanbd-secret.yaml) deploy the Kubernetes resources that are required by the Argo workflow, which is deployed using the fourth file (04-finetune-workflow.yaml).

Understanding the Fine-tune Workflow

The Argo workflow for fine-tuning is defined in the 04-finetune-workflow.yaml file. This file consists of three important sections:
Visualization of the DAG defined by the Argo workflow
Visualization of the DAG defined by the Argo workflow


At the top of the 04-finetune-workflow.yaml file, all of the Workflow parameters and their default values are defined. If none of these are changed, the Workflow will download and tokenize the Hackernews subset of The Pile dataset before fine-tuning GPT-NeoX-20B on it using two A100_NVLINK_80GB nodes in the LAS1 region.
If you have already downloaded the model weights, dataset, and/or tokenized the dataset, you can skip those steps by changing the respective parameters to false:
  • download_checkpoint
  • download_dataset
  • tokenize_dataset
The screenshot above shows these steps being skipped.
The fine-tuning stage uses Weights and Biases for extra logging. This is controlled by the three wandb parameters. To avoid using Weights and Biases, you can set the values of these parameters to an empty string.

DAG definition

The DAG is defined in the Workflow file under the main template. The graph is organized based on the dependencies of all the stages. For example, the two download stages don't have any dependencies, so they both run in parallel right away.
It is not demonstrated in this example, but it is possible to have more advanced logic around dependencies, such as running certain stages based on the completion status of another stage.

Source code

The code that performs the dataset tokenization and fine-tuning is straight from Eluther AI's gpt-neox repo. This code is built into the Docker image that is built from CoreWeave's ml-containers repo.

Training config

The config file containing all of the training-related parameters is stored within a Kubernetes ConfigMap. If you would like to adjust these parameters, the ConfigMap can be edited within the 04-finetune-workflow.yaml file in the create-configmap stage definition.

Fine-tuning resources

By default, this Workflow uses two A100 nodes with NVLINK and RDMA over InfiniBand for extremely fast distributed training. These resources are defined in the Worker spec of the MPIJob at the bottom of the Workflow's YAML manifest:
cpu: 48
memory: 256Gi 8
rdma/ib: 1
limits: 8
rdma/ib: 1

RDMA over InfiniBand

As is shown in this example, the rdma/ib resource requests that Remote Direct Memory Access (RDMA) be performed using InfiniBand. RDMA allows for direct memory access from the memory of one computer into that of another without involving the Operating System of either machine, which is accomplished using InfiniBand packets over Ethernet.
Requesting this resource offers a big boost to distributed training performance, however it is currently only available for A100 and H100 GPU node types on CoreWeave Cloud. Learn more about InfiniBand on CoreWeave Cloud.


Before running the Workflow, a few things need to be created in your namespace.
This guide assumes that you have already followed the process to set up the CoreWeave Kubernetes environment. If you have not done so already, follow our Getting Started guide before proceeding with this guide.

Argo Workflows

To run an Argo workflow, first deploy the Argo Workflows application in your namespace via the CoreWeave's application Catalog.
Additional Information
For more information on Argo Workflows, see Workflows.


The Argo workflow uses Persistent Volume Claim's (PVCs) to store the dataset and model checkpoints. The PVCs are defined in 01-pvc.yaml and be deployed with kubectl:
kubectl apply -f 01-pvc.yaml
You can deploy a FileBrowser application attaching the newly created PVCs to be able to inspect their contents in your browser.

Fine-tune role

This Argo Workflow involves creating new Kubernetes resources in your environment: a ConfigMap, and an MPIJob. In order to do create these, the Workflow needs to run as a Service Account with the necessary permissions granted to it.
The finetune service account and its permissions that will later be used by the workflow is defined in the 02-finetune-role.yaml. Apply it to your namespace with kubectl:
kubectl apply -f 02-finetune-role.yaml

Weights and Biases secret

If you would like to take advantage of Weights and Biases logging during fine-tuning, create a Secret that contains your WandB account key. To do this, first acquire your key from Weights and Biases and encode it using base64.
$ echo -n "example-wanbd-key" | base64
Then, copy the encoded value into line 3 of 03-wandb-secret.yaml.
When complete, the file should look like this:
apiVersion: v1
token: ZXhhbXBsZS13YW5iZC1rZXk=
kind: Secret
name: wandb-token-secret
type: Opaque
Once the file is updated with your encoded account key, apply it to your namespace with kubectl:
kubectl apply -f 03-wandb-secret.yaml

Run the Workflow

Once all of the necessary resources are created, submit the Workflow using the Argo CLI. If it is not already installed, follow Argo's installation instructions to install the CLI tool.
To submit the Workflow to the Argo server created earlier, use the argo submit command. The -p flag can be used to set the value for any of the parameters in the YAML file in line.
argo submit 04-finetune-workflow.yaml \
-p run_name=finetune_gpt_neox \
--serviceaccount finetune
Once the Workflow is submitted, its progress may be monitored from the Argo Workflows Web UI, which can be accessed via the URL provided in the application's deployment page. Retrieve this page by navigating to the Applications page on CoreWeave Cloud, then clicking on the Argo application.
Pod logs may be acquired via CLI using kubectl logs <pod name>, or by clicking on the relevant stage in the Argo Workflows Web UI.
Argo Workflow right after submission
The logs from the fine-tuning training script are available from the launcher Pod. They can be accessed via kubectl:
kubectl logs finetune-gpt-neox-n6mnd-mpijob-launcher-xz98s
Once complete, the fine-tuned model checkpoint will be saved in the neox-checkpoints PVC. The path is defined as a workflow parameter and defaults to pvc://neox-checkpoints/20B_finetuned_checkpoint.
Deleting the Argo Workflow alone won't remove all of the resources. The mpijob resource that was used for fine-tuning, and the configmap resource will still exist. To delete them, target them specifically using kubectl delete:
Finally, remove the resources that were created prior to running the Workflow. You can delete the Argo Workflows deployment through the CoreWeave's Applications UI. The PVCs, fine-tune Service Account role, and Weights and Biases secret can be deleted by targeting the files that created them with kubectl delete:

Clean up

Once the Workflow has finished, the Pods used for all Workflow stages will move to a Completed state in order to keep the logs available for viewing. At this point, they are no longer using any compute resources, so will not incur any cost.
The easiest way to clean all of these Pods up is by deleting the Workflow run from the Argo Workflows Web UI. You may also delete them manually using kubectl delete pod.
Deleting the Argo Workflow alone won't remove all of the resources. The mpijob resource that was used for fine-tuning, and the configmap resource will still exist. To delete them, target them specifically using kubectl delete:
kubectl delete configmap neox-training
kubectl delete mpijob <mpijob-name-created-by-argo>
The unique name of the mpijob can be acquired using kubectl get mpijob.
Finally, remove the resources that were created prior to running the Workflow. You can delete the Argo Workflows deployment through the CoreWeave's Applications UI. The PVCs, fine-tune Service Account role, and Weights and Biases secret can be deleted by targeting the files that created them with kubectl delete:
kubectl delete -f 03-wandb-secret.yaml
kubectl delete -f 02-finetune-role.yaml
kubectl delete -f 01-pvc.yaml

Debugging Race Conditions

The GPT-NeoX code can hang indefinitely if both worker Pods are not up and running when the main container in the launcher begins to run.
This tutorial uses an init container in the launcher that sleeps for 60 seconds to prevent this problem. In most cases, this delay is sufficient, but the code can hang indefinitely if the worker Pods take longer than 60 seconds to spin up or remain in a Pending state due to a lack of resource availability. For example, there may not be enough A100s available in the selected region. If so, consider using the A40 option below.
To detect if the launcher is properly connected to both of the worker Pods, check the worker logs with kubectl logs.
  • If the most recent line is Accepted publickey for root..., then the launcher is connected.
  • On the other hand, if the most recent log is Disconnected..., then the launcher isn't currently connected.
Here is an example of a launcher properly connected to the worker:
$ kubectl logs finetune-gpt-neox-n6mnd-mpijob-worker-1
Server listening on port 22.
Server listening on :: port 22.
Accepted publickey for root from port 38432 ssh2: ECDSA SHA256:mq7qxkWCmx7Srl2iavbJ0Dk7KsBriu1UvYnUCcruAts

A40 option

This tutorial uses two Nodes with A100 80GB cards and Infiniband, for a total of 16 GPUs. This delivers very high performance, but you may find that you aren't able to schedule that many A100s on-demand, leaving the worker Pods stuck in the Pending state. If you require A100 GPUs that you can't schedule on-demand, please reach out to CoreWeave support and ask about Reserved Instances.
Alternatively, you can use A40s by changing the default parameters when submitting the workflow to argo , as shown.
argo submit 04-finetune-workflow.yaml \
-p run_name=finetune_gpt_neox \
-p trainer_gpu=A40 \
-p use_ib=false \
-p micro_batch_size=4 \
-p gradient_accumulation_steps=192 \
--serviceaccount finetune
The main differences in this workflow are:
  • GPU affinity of the fine-tune job has been changed to A40
  • the resource request/limit of rdma/ib: 1 is changed to rdma/ib: 0
  • the GPT-NeoX config reflect the smaller memory of the A40

More information

See these resources for more information: