Comment on page
Fine-tune GPT-NeoX-20B with Argo Workflows
Use Argo Workflows to run a full pipeline which ends with distributed fine-tuning of GPT-NeoX-20B.
Similarly to the Fine-tuning Machine Learning Models tutorial, the following walkthrough provides an example of using Argo Workflows to fine-tune a smaller model (GPT-J) on a smaller dataset. If you are new to fine-tuning and Argo Workflows, this is a great place to start.
This example uses two A100 nodes (16 total GPUs) using NVIDIA's NVLINK and Infiniband technologies for highly performant distributed training.
Optional A40 configuration
If A100 resources aren't available in the selected region, A40 nodes can be substituted. See the A40 option later in this tutorial and check the Debugging section for more tips.
Throughout the rest of this document, referenced files may be found in CoreWeave's
kubernetes-cloud
repo in the kubeflow/training-operator/gpt-neox
folder.
kubernetes-cloud/kubeflow/training-operator/gpt-neox at master · coreweave/kubernetes-cloud
GitHub
Link to the source code referenced in this example.
In this folder, you will find numbered YAML files:
01-pvc.yaml
02-finetune-role.yaml
03-wanbd-secret.yaml
04-finetune-workflow.yaml
These should be deployed in the order in which they are numbered. The first three YAML files (
01-pvc.yaml
, 02-finetune-role.yaml
, and 03-wanbd-secret.yaml
) deploy the Kubernetes resources that are required by the Argo workflow, which is deployed using the fourth file (04-finetune-workflow.yaml
).The Argo workflow for fine-tuning is defined in the
04-finetune-workflow.yaml
file. This file consists of three important sections:- workflow parameters,
- and the step definitions.

Visualization of the DAG defined by the Argo workflow
At the top of the
04-finetune-workflow.yaml
file, all of the Workflow parameters and their default values are defined. If none of these are changed, the Workflow will download and tokenize the Hackernews subset of The Pile dataset before fine-tuning GPT-NeoX-20B on it using two A100_NVLINK_80GB
nodes in the LAS1
region.If you have already downloaded the model weights, dataset, and/or tokenized the dataset, you can skip those steps by changing the respective parameters to
false
:download_checkpoint
download_dataset
tokenize_dataset
Note
The screenshot above shows these steps being skipped.
The fine-tuning stage uses Weights and Biases for extra logging. This is controlled by the three
wandb
parameters. To avoid using Weights and Biases, you can set the values of these parameters to an empty string.The DAG is defined in the Workflow file under the
main
template. The graph is organized based on the dependencies of all the stages. For example, the two download stages don't have any dependencies, so they both run in parallel right away.Note
It is not demonstrated in this example, but it is possible to have more advanced logic around dependencies, such as running certain stages based on the completion status of another stage.
The code that performs the dataset tokenization and fine-tuning is straight from Eluther AI's
gpt-neox
repo. This code is built into the Docker image that is built from CoreWeave's ml-containers
repo.The config file containing all of the training-related parameters is stored within a Kubernetes ConfigMap. If you would like to adjust these parameters, the ConfigMap can be edited within the
04-finetune-workflow.yaml
file in the create-configmap
stage definition.By default, this Workflow uses two A100 nodes with NVLINK and RDMA over InfiniBand for extremely fast distributed training. These resources are defined in the
Worker
spec of the MPIJob at the bottom of the Workflow's YAML manifest:resources:
requests:
cpu: 48
memory: 256Gi
nvidia.com/gpu: 8
rdma/ib: 1
limits:
nvidia.com/gpu: 8
rdma/ib: 1
As is shown in this example, the
rdma/ib
resource requests that Remote Direct Memory Access (RDMA) be performed using InfiniBand. RDMA allows for direct memory access from the memory of one computer into that of another without involving the Operating System of either machine, which is accomplished using InfiniBand packets over Ethernet.Requesting this resource offers a big boost to distributed training performance, however it is currently only available for A100 and H100 GPU node types on CoreWeave Cloud. Learn more about InfiniBand on CoreWeave Cloud.
Before running the Workflow, a few things need to be created in your namespace.
Note
This guide assumes that you have already followed the process to set up the CoreWeave Kubernetes environment. If you have not done so already, follow our Getting Started guide before proceeding with this guide.
To run an Argo workflow, first deploy the Argo Workflows application in your namespace via the CoreWeave's application Catalog.
Additional Information
The Argo workflow uses Persistent Volume Claim's (PVCs) to store the dataset and model checkpoints. The PVCs are defined in
01-pvc.yaml
and be deployed with kubectl
:kubectl apply -f 01-pvc.yaml
Optional
You can deploy a FileBrowser application attaching the newly created PVCs to be able to inspect their contents in your browser.
This Argo Workflow involves creating new Kubernetes resources in your environment: a ConfigMap, and an MPIJob. In order to do create these, the Workflow needs to run as a Service Account with the necessary permissions granted to it.
The
finetune
service account and its permissions that will later be used by the workflow is defined in the 02-finetune-role.yaml
. Apply it to your namespace with kubectl
:kubectl apply -f 02-finetune-role.yaml
If you would like to take advantage of Weights and Biases logging during fine-tuning, create a Secret that contains your WandB account key. To do this, first acquire your key from Weights and Biases and encode it using base64.
$ echo -n "example-wanbd-key" | base64
ZXhhbXBsZS13YW5iZC1rZXk=
Then, copy the encoded value into line
3
of 03-wandb-secret.yaml
.When complete, the file should look like this:
apiVersion: v1
data:
token: ZXhhbXBsZS13YW5iZC1rZXk=
kind: Secret
metadata:
name: wandb-token-secret
type: Opaque
Once the file is updated with your encoded account key, apply it to your namespace with
kubectl
:kubectl apply -f 03-wandb-secret.yaml
Once all of the necessary resources are created, submit the Workflow using the Argo CLI. If it is not already installed, follow Argo's installation instructions to install the CLI tool.
To submit the Workflow to the Argo server created earlier, use the
argo submit
command. The -p
flag can be used to set the value for any of the parameters in the YAML file in line.argo submit 04-finetune-workflow.yaml \
-p run_name=finetune_gpt_neox \
--serviceaccount finetune
Once the Workflow is submitted, its progress may be monitored from the Argo Workflows Web UI, which can be accessed via the URL provided in the application's deployment page. Retrieve this page by navigating to the Applications page on CoreWeave Cloud, then clicking on the Argo application.
Pod logs may be acquired via CLI using
kubectl logs <pod name>
, or by clicking on the relevant stage in the Argo Workflows Web UI.
Argo Workflow right after submission
The logs from the fine-tuning training script are available from the launcher Pod. They can be accessed via
kubectl
:kubectl logs finetune-gpt-neox-n6mnd-mpijob-launcher-xz98s
Note
Once complete, the fine-tuned model checkpoint will be saved in the
neox-checkpoints
PVC. The path is defined as a workflow parameter and defaults to pvc://neox-checkpoints/20B_finetuned_checkpoint
.Deleting the Argo Workflow alone won't remove all of the resources. The
mpijob
resource that was used for fine-tuning, and the configmap
resource will still exist. To delete them, target them specifically using kubectl delete
:Finally, remove the resources that were created prior to running the Workflow. You can delete the Argo Workflows deployment through the CoreWeave's Applications UI. The PVCs, fine-tune Service Account role, and Weights and Biases secret can be deleted by targeting the files that created them with
kubectl delete
:Once the Workflow has finished, the Pods used for all Workflow stages will move to a
Completed
state in order to keep the logs available for viewing. At this point, they are no longer using any compute resources, so will not incur any cost.The easiest way to clean all of these Pods up is by deleting the Workflow run from the Argo Workflows Web UI. You may also delete them manually using
kubectl delete pod
.Deleting the Argo Workflow alone won't remove all of the resources. The
mpijob
resource that was used for fine-tuning, and the configmap
resource will still exist. To delete them, target them specifically using kubectl delete
:kubectl delete configmap neox-training
kubectl delete mpijob <mpijob-name-created-by-argo>
Note
The unique name of the
mpijob
can be acquired using kubectl get mpijob
.Finally, remove the resources that were created prior to running the Workflow. You can delete the Argo Workflows deployment through the CoreWeave's Applications UI. The PVCs, fine-tune Service Account role, and Weights and Biases secret can be deleted by targeting the files that created them with
kubectl delete
:kubectl delete -f 03-wandb-secret.yaml
kubectl delete -f 02-finetune-role.yaml
kubectl delete -f 01-pvc.yaml
The GPT-NeoX code can hang indefinitely if both worker Pods are not up and running when the main container in the launcher begins to run.
This tutorial uses an init container in the launcher that sleeps for 60 seconds to prevent this problem. In most cases, this delay is sufficient, but the code can hang indefinitely if the worker Pods take longer than 60 seconds to spin up or remain in a
Pending
state due to a lack of resource availability. For example, there may not be enough A100s available in the selected region. If so, consider using the A40 option below.To detect if the launcher is properly connected to both of the worker Pods, check the worker logs with
kubectl logs
. - If the most recent line is
Accepted publickey for root...
, then the launcher is connected. - On the other hand, if the most recent log is
Disconnected...
, then the launcher isn't currently connected.
Here is an example of a launcher properly connected to the worker:
$ kubectl logs finetune-gpt-neox-n6mnd-mpijob-worker-1
Server listening on 0.0.0.0 port 22.
Server listening on :: port 22.
Accepted publickey for root from 10.145.231.160 port 38432 ssh2: ECDSA SHA256:mq7qxkWCmx7Srl2iavbJ0Dk7KsBriu1UvYnUCcruAts
This tutorial uses two Nodes with A100 80GB cards and Infiniband, for a total of 16 GPUs. This delivers very high performance, but you may find that you aren't able to schedule that many A100s on-demand, leaving the worker Pods stuck in the
Pending
state. If you require A100 GPUs that you can't schedule on-demand, please reach out to CoreWeave support and ask about Reserved Instances. Alternatively, you can use A40s by changing the default parameters when submitting the workflow to
argo
, as shown.argo submit 04-finetune-workflow.yaml \
-p run_name=finetune_gpt_neox \
-p trainer_gpu=A40 \
-p use_ib=false \
-p micro_batch_size=4 \
-p gradient_accumulation_steps=192 \
--serviceaccount finetune
The main differences in this workflow are:
- GPU affinity of the fine-tune job has been changed to
A40
- the resource request/limit of
rdma/ib: 1
is changed tordma/ib: 0
- the GPT-NeoX config reflect the smaller memory of the A40
See these resources for more information: