CoreWeave
Search…
Finetuning GPT-NeoX 20B using DeterminedAI
This page details the setup and process to train or fine-tune a GPT-NeoX 20B parameter model on CoreWeave cloud.

Introduction

GPT-NeoX is a 20B parameter autoregressive model trained on the Pile dataset.
It generates text based on context or unconditionally for use cases such as story generation, chat bots, summarization, and so on.
This model is trained on CoreWeave infrastructure and the weights are made available via a permissive license. Based on your requirements and use case, this model is capable of high quality text generation. Many customers have seen drastically improved results by finetuning the model with data specific to their use case.
This guide will use the DeterminedAI MLOps platform to run distributed finetuning jobs on the model.

Setup

Note
This guide makes several assumptions: • You have set up the CoreWeave Kubernetes environment. • You have some experience launching and using DeterminedAI on CoreWeave Cloud. (If you have not done so already, it is recommended to deploy DeterminedAI via the application Catalog to familiarize yourself with it). • You have git installed on your terminal.

Create a Shared Filesystem storage volume

First, create a Shared Filesystem storage volume from the Storage menu on the CoreWeave Cloud UI. This volume will be used to store the model weights as well as training data for finetuning. Shared storage volumes can be accessed by many nodes at once in CoreWeave, allowing for massive amounts of compute power to access the same dataset.
You can use the values shown and described below for this tutorial.
Create a New Volume on the Storage menu from the Cloud UI
The values used for this demo are as follows:
Field name
Demo value
Volume Name
finetune-gpt-neox
Region
LAS1
Disk Class
HDD
Storage Type
Shared Filesystem
Size (Gi)
1,000
Note If needed, it is easy to increase the size of a storage volume later.

Install the Filebrowser application

The filebrowser application, available through the application Catalog, allows you to access your storage volumes via a Web interface that allows you to upload and download files and folders.
It is recommended that the name you give this filebrowser application be very short, or you will run into SSL CNAME issues. We recommend finetune.
Simply select the finetune-gpt-neox PVC that you created earlier. Make sure that you actually add your PVC to the filebrowser list of mounts!
The filebrowser application in the Cloud UI application Catalog
Note Installing the filebrowser application is very helpful to this process. As an alternative, it may be preferable to you to launch a Virtual Server or Kubernetes Pod to interact with their PVC via SSH or other mechanism.

Install the Determined application

From the application Catalog, search for determined. This will bring up the Determined.ai (determined) application, which you can then deploy into your cluster.
The DeterminedAI application in the Cloud UI application Catalog
The installation values should look similar to the ones shown and described below.
First, create an object storage bucket, which will be used to store checkpoints. You will then have access to <YOUR_ACCESS_KEY> and <YOUR_SECRET_KEY>.
Note
Object storage is currently in beta. Please contact support for access.
The values used for this demonstration are as follows:

Region

LAS1 (Las Vegas)

Default Resources

Field
Demo Value
Default resources
8 vCPUs
Memory request
256Gi
GPU Type
A40

Object Storage Configuration

Field
Demo Value
Bucket Name
model-checkpoints
Access Key
<YOUR_ACCESS_KEY> - this should be replaced by your actual Object Storage access key
Secret Key
<YOUR_SECRET_KEY> - this should be replaced by your actual Object Storage secret key
Note
You will acquire the ACCESS_KEY and SECRET_KEY once object storage has been configured for you. Contact support for more information.

Attaching the volume

Click + to attach the finetune-gpt-neox volume.
The attachment configuration screen for the DeterminedAI application
As shown above, for this tutorial we are attaching the finetune-gpt-neox volume on the mount path /mnt/finetune-gpt-neox.

Determined Web UI

After deploying the DeterminedAI application, a URL to the Web UI will be provided to you. You can use this UI to monitor experiments and check logs.
The DeterminedAI Web UI
As an example, here is what a live experiment looks like when viewed from the Web UI:
A live experiment running in the DeterminedAI Web UI
Navigating to the Logs tab will give you a full output of the experiment's logs:
Log output from the DeterminedAI Web UI
Navigating to Overview will give you access to a metrics visualization of the experiment and checkpoint reference.
Metrics visualization in the DeterminedAI Web UI

Training

Configure your dataset

Important
Run theexport DET_MASTER=...ord1.ingress.coreweave.cloud:80 command, found in the post-installation notes from the DeterminedAI deployment, prior to running the next command.
Clone the GPT-NeoX repository to your CoreWeave Cloud Storage in a terminal:
det cmd run 'git clone https://github.com/EleutherAI/gpt-neox.git /mnt/finetune-gpt-neox/gpt-neox'
Then, download the Vocab and Merge files to your CoreWeave Cloud Storage in a terminal:
det cmd run 'wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
-O /mnt/finetune-gpt-neox/gpt2-vocab.json'
det cmd run 'wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
-O /mnt/finetune-gpt-neox/gpt2-merges.txt'

Dataset Format

Datasets for GPT-NeoX should be one large document in JSONL format. To prepare your own dataset for training with custom data, format it as one large JSONL-formatted file, where each item in the list of dictionaries is a separate document.
The document text should be grouped under one JSON key, i.e "text".

Example

{"text": "We have received the water well survey for the N. Crawar site. The Crane \nCounty Water District owns 11 water wells within a one mile radius of the \nsite. The nearest well is located about 1000 feet east of the site. It \nappears that all of the wells are completed into the uppermost aquifer and, \nin general, are screened from 50' bgs to total depth at about 140' bgs. The \nshallow water table at the site and in the nearby wells is at about 50' bgs. \nThe groundwater flow direction at the site has consistently been toward due \nsouth. There are no water supply wells within one mile due south of the site. \nThere are two monitor wells at the east side of the site which have always \nproduced clean samples. The remediation system for this site should be \noperational by April 1, 2001. We will also have more current groundwater \nsampling information for the site within the next few weeks.", "meta": {}}
{"text": "Roger-\nWe will require off-site access for the installation of 10 remediation wells \nat the North Crawar facility (formerly owned by TW but now owned by Duke). I \nwill drop off in your office a site diagram that indicates the location of \nthe wells relative to the facility and the existing wells. I believe that the \nadjacent property owner had been contacted prior to well installations in \nAugust 1997, but I am not familiar with the details of the access agreement \nor even who within ET&S made the arrangements. We are shooting for \nearly-October for the well installations. We may also want to address \ncontinued access to the wells in an agreement with the landowner (a 5-10 year \nterm should be sufficient). Give me a call at x67327 if you have any \nquestions.\nThanks,\nGeorge", "meta": {}}
{"text": "Larry, the attached file contains a scanned image of a story that was \npublished in The Monahans News, a weekly paper, on Thursday, April 20, 2000. \nI've shown the story to Bill, and he suggested that you let Rich Jolly know \nabout the story.\n\nThanks, George\n\n\n---------------------- Forwarded by George Robinson/OTS/Enron on 04/24/2000 \n03:29 PM ---------------------------\n\n04/24/2000 03:21 PM\nMichelle Muniz\nMichelle Muniz\nMichelle Muniz\n04/24/2000 03:21 PM\n04/24/2000 03:21 PM\nTo: George Robinson/OTS/[email protected]\ncc: \n\nSubject: Newspaper\n\nI hope this works. MM", "meta": {}}
There are several standard datasets that you can leverage for testing.

Pre-processing your dataset

Upload your data as a single JSONL file called data.jsonl to filebrowser under finetune-gpt-neox:
Using the filebrowser app, create a new folder called gpt_finetune under the finetune-gpt-neox folder.
Creating the gpt_finetune directory in filebrowser
You can now pre-tokenize your data using tools/preprocess_data.py. The arguments for this utility are listed below.
preprocess_data.py Arguments
The command to tokenize your data and output it to gpt_finetune is below:
python tools/preprocess_data.py \
--input /mnt/finetune-gpt-neox/data.jsonl \
--output-prefix ./gpt_finetune/mydataset \
--vocab /mnt/finetune-gpt-neox/gpt2-vocab.json \
--merge-file /mnt/finetune-gpt-neox/gpt2-merges.txt \
--dataset-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--append-eod
Run this command to pre-process and tokenize your data:
det cmd run 'apt-get -y install libopenmpi-dev; pip install -r /mnt/finetune-gpt-neox/gpt-neox/requirements/requirements.txt; pip install protobuf==3.20; python /mnt/finetune-gpt-neox/gpt-neox/tools/preprocess_data.py --input /mnt/finetune-gpt-neox/data.jsonl --output-prefix /mnt/finetune-gpt-neox/gpt_finetune/mydataset --vocab /mnt/finetune-gpt-neox/gpt2-vocab.json --merge-file /mnt/finetune-gpt-neox/gpt2-merges.txt --dataset-impl mmap --tokenizer-type GPT2BPETokenizer --append-eod'
Important
Tokenized data will be saved out to two files:
<data-dir>/<dataset-name>/<dataset-name>_text_document.binand <data-dir>/<dataset-name>/<dataset-name>_text_document.idx.
You will need to add the prefix that both these files share to your training configuration file under the data-path field.
You should see the data here similar to below:

Finetuning

Important
Run theexport DET_MASTER=...ord1.ingress.coreweave.cloud:80 command, found in the post-installation notes from the DeterminedAI deployment, prior to running the next command.
Download the "Slim" weights by running the following commands in a terminal:
det cmd run 'wget --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ -P /mnt/finetune-gpt-neox/20B_checkpoints'
Important
Ensure that the above command completes executing. Depending on your network bandwidth, downloading weights can take up to an hour or two for 39GB of data. You can monitor the logs of the above command using the logs command: det task logs -f <TASK_NAME_FROM_ABOVE>

Download the training examples

DeterminedAI provides training examples on GitHub. Clone the source code for the DeterminedAI from their repository in a terminal in an acccesible path:
$ git clone https://github.com/determined-ai/determined.git
The deployment configurations for the experiments and the source code to run the finetuning job are located in the GPT-NeoX example directory under examples/deepspeed/gpt_neox.
Additional Resources
EleutherAI provides a lot of useful information on their provided datasets, which may be helpful when configuring datasets and training parameters for tensor and pipeline parallelism for finetuning using GPT-NeoX.
Navigate from the root of the determined.ai source code you cloned previously to the examples/deepspeed/gpt_neox directory.
Review and replace the contents of the original determined-cluster.yml file with the content below to configure the cluster for 96 GPUs in examples/deepspeed/gpt_neox/gpt_neox_config/determined-cluster.yml. You may configure or change any of the optimizer values or training configurations to your needs. It is recommended to use the NeoX source code as reference when doing so.
Click to expand - determined-cluster.yml
(Optional) You can customize the deployment to 8 GPUs as well using the below configuration. Ensure you set slots_per_trial: 8 in the following (next section) experiment configuration file finetune-gpt-neox.yml:
Click to expand - determined-cluster.yml

Create the experiment

Note
You will need to be in the examples/deepspeed/gpt_neox directory
Copy the below configuration into a file called finetune-gpt-neox.yml
Click to expand - finetune-gpt-neox.yml
Note
Many of the parameters in the above configuration can be changed, such as batches, and slots_per_trail. We use default values of 100 batches to finetune on with 50 batches before validation or early stopping, and 96 A40 GPUs .
Run the following command to launch the experiment:
det experiment create finetune-gpt-neox.yml .
The experiment is now launched! You can see the status of your experiment and monitor logs as well using the Web UI.
You should see an "Active" status for your experiment:
You can visualize and monitor logs:
Once training is completed, you will have access to the checkpoint in your S3 bucket for downstream tasks such as inference, transfer learning or model ensembles.

(Optional) Wandb.ai visualization of training graphs

Weights & Biases AI (Wandb.ai) can be installed and configured to visualize training graphs.
Pass in the <WANDB_GROUP> and <WANDB_TEAM> variables to your configuration file.
name: gpt-neox-zero1-3d-parallel
debug: false
profiling:
enabled: true
begin_on_batch: 1
end_after_batch: 100
sync_timings: true
hyperparameters:
search_world_size: false
conf_dir: /gpt-neox/configs
conf_file:
- determined_cluster.yml
wandb_group: <WANDB_GROUP>
wandb_team: <WANDB_TEAM>
environment:
environment_variables:
- NCCL_DEBUG=INFO
- NCCL_SOCKET_IFNAME=ens,eth,ib
force_pull_image: true
image:
gpu: liamdetermined/gpt-neox
resources:
slots_per_trial: 96 # Utilize 96 GPUs for the finetune
searcher:
name: single
metric: lm_loss
smaller_is_better: false
max_length:
batches: 1000
min_validation_period:
batches: 500
max_restarts: 0
entrypoint:
- python3
- -m
- determined.launch.deepspeed
- --trial
- gpt2_trial:GPT2Trial
Additional Resources