Fine-tune GPT-NeoX 20B with Determined AI

Learn how to fine-tune a GPT-NeoX 20B parameter model on CoreWeave Cloud

GPT-NeoX is a 20B parameter autoregressive model trained on the Pile dataset.

It generates text based on context or unconditionally for use cases such as story generation, chat bots, summarization, and so on.

This model is trained on CoreWeave infrastructure and the weights are made available via a permissive license. Based on your requirements and use case, this model is capable of high quality text generation. Many customers have seen drastically improved results by finetuning the model with data specific to their use case.

This guide will use the DeterminedAI MLOps platform to run distributed finetuning jobs on the model.

Prerequisites

This guide assumes that the following are completed in advance.

The values used for this demo's shared filesystem volume are as follows:

Field nameDemo value

Volume Name

finetune-gpt-neox

Region

LAS1

Disk Class

HDD

Storage Type

Shared Filesystem

Size (Gi)

1000

Note

If needed, it is easy to increase the size of a storage volume later.

Attach the filesystem volume

When installing Determined AI, ensure that the newly-created filesystem volume for this demo is attached. From the bottom of the application configuration screen, click + to attach the finetune-gpt-neox volume.

As shown above, for this tutorial we are attaching the finetune-gpt-neox volume to mount path /mnt/finetune-gpt-neox.

Determined Web UI

After deploying the DeterminedAI application, a URL to the Web UI will be provided. Navigate here to use the Determined AI Web UI, which may be used to monitor experiments and to check logs.

As an example, here is what a live experiment looks like when viewed from the Web UI.

Navigating to the Logs tab will give you a full output of the experiment's logs:

Navigating to Overview will give you access to a metrics visualization of the experiment and checkpoint reference.

Training

Configure your dataset

Important

Run theexport DET_MASTER=...ord1.ingress.coreweave.cloud:80 command, found in the post-installation notes from the DeterminedAI deployment, prior to running the next command.

Clone the GPT-NeoX repository to your CoreWeave Cloud Storage in a terminal:

det cmd run 'git clone https://github.com/EleutherAI/gpt-neox.git /mnt/finetune-gpt-neox/gpt-neox'

Then, download the Vocab and Merge files to your CoreWeave Cloud Storage in a terminal:

det cmd run 'wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
-O /mnt/finetune-gpt-neox/gpt2-vocab.json'

det cmd run 'wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
-O /mnt/finetune-gpt-neox/gpt2-merges.txt'

det cmd run 'wget https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/20B_tokenizer.json 
-O /mnt/finetune-gpt-neox/20B_tokenizer.json'

Dataset Format

Datasets for GPT-NeoX should be one large document in JSONL format. To prepare your own dataset for training with custom data, format it as one large JSONL-formatted file, where each item in the list of dictionaries is a separate document.

The document text should be grouped under one JSON key, i.e "text".

Example

{"text": "We have received the water well survey for the N. Crawar site. The Crane \nCounty Water District owns 11 water wells within a one mile radius of the \nsite. The nearest well is located about 1000 feet east of the site. It \nappears that all of the wells are completed into the uppermost aquifer and, \nin general, are screened from 50' bgs to total depth at about 140' bgs. The \nshallow water table at the site and in the nearby wells is at about 50' bgs. \nThe groundwater flow direction at the site has consistently been toward due \nsouth. There are no water supply wells within one mile due south of the site. \nThere are two monitor wells at the east side of the site which have always \nproduced clean samples. The remediation system for this site should be \noperational by April 1, 2001. We will also have more current groundwater \nsampling information for the site within the next few weeks.", "meta": {}}
{"text": "Roger-\nWe will require off-site access for the installation of 10 remediation wells \nat the North Crawar facility (formerly owned by TW but now owned by Duke). I \nwill drop off in your office a site diagram that indicates the location of \nthe wells relative to the facility and the existing wells. I believe that the \nadjacent property owner had been contacted prior to well installations in \nAugust 1997, but I am not familiar with the details of the access agreement \nor even who within ET&S made the arrangements. We are shooting for \nearly-October for the well installations. We may also want to address \ncontinued access to the wells in an agreement with the landowner (a 5-10 year \nterm should be sufficient). Give me a call at x67327 if you have any \nquestions.\nThanks,\nGeorge", "meta": {}}
{"text": "Larry, the attached file contains a scanned image of a story that was \npublished in The Monahans News, a weekly paper, on Thursday, April 20, 2000. \nI've shown the story to Bill, and he suggested that you let Rich Jolly know \nabout the story.\n\nThanks, George\n\n\n---------------------- Forwarded by George Robinson/OTS/Enron on 04/24/2000 \n03:29 PM ---------------------------\n\n04/24/2000 03:21 PM\nMichelle Muniz\nMichelle Muniz\nMichelle Muniz\n04/24/2000 03:21 PM\n04/24/2000 03:21 PM\nTo: George Robinson/OTS/Enron@ENRON\ncc:  \n\nSubject: Newspaper\n\nI hope this works.  MM", "meta": {}}

There are several standard datasets that you can leverage for testing.

Pre-processing your dataset

Upload your data as a single JSONL file called data.jsonl to filebrowser under finetune-gpt-neox:

Using the FileBrowser app, create a new folder called gpt_finetune under the finetune-gpt-neox folder.

You can now pre-tokenize your data using tools/preprocess_data.py. The arguments for this utility are listed below.

preprocess_data.py Arguments
usage: preprocess_data.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--num-docs NUM_DOCS] --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer} [--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] --output-prefix OUTPUT_PREFIX
                          [--dataset-impl {lazy,cached,mmap}] [--workers WORKERS] [--log-interval LOG_INTERVAL]

optional arguments:
  -h, --help            show this help message and exit

input data:
  --input INPUT         Path to input jsonl files or lmd archive(s) - if using multiple archives, put them in a comma separated list
  --jsonl-keys JSONL_KEYS [JSONL_KEYS ...]
                        space separate listed of keys to extract from jsonl. Defa
  --num-docs NUM_DOCS   Optional: Number of documents in the input data (if known) for an accurate progress bar.

tokenizer:
  --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer}
                        What type of tokenizer to use.
  --vocab-file VOCAB_FILE
                        Path to the vocab file
  --merge-file MERGE_FILE
                        Path to the BPE merge file (if necessary).
  --append-eod          Append an <eod> token to the end of a document.
  --ftfy                Use ftfy to clean text

output data:
  --output-prefix OUTPUT_PREFIX
                        Path to binary output file without suffix
  --dataset-impl {lazy,cached,mmap}
                        Dataset implementation to use. Default: mmap

runtime:
  --workers WORKERS     Number of worker processes to launch
  --log-interval LOG_INTERVAL
                        Interval between progress updates

The command to tokenize your data and output it to gpt_finetune is below:

python tools/preprocess_data.py \
            --input /mnt/finetune-gpt-neox/data.jsonl \
            --output-prefix /mnt/gpt_finetune/mydataset \
            --vocab /mnt/finetune-gpt-neox/20B_tokenizer.json \
            --tokenizer-type HFTokenizer

Run this command to pre-process and tokenize your data:

det cmd run 'apt-get -y install libopenmpi-dev; 
            pip install -r /mnt/finetune-gpt-neox/gpt-neox/requirements/requirements.txt; 
            python tools/preprocess_data.py \
            --input /mnt/finetune-gpt-neox/data.jsonl \
            --output-prefix /mnt/gpt_finetune/mydataset \
            --vocab /mnt/finetune-gpt-neox/20B_tokenizer.json \
            --tokenizer-type HFTokenizer'

Important

Tokenized data will be saved out to two files:

<data-dir>/<dataset-name>/<dataset-name>_text_document.binand <data-dir>/<dataset-name>/<dataset-name>_text_document.idx.

You will need to add the prefix that both these files share to your training configuration file under the data-path field.

You should see the data here similar to below:

Finetuning

Important

Run theexport DET_MASTER=...ord1.ingress.coreweave.cloud:80 command, found in the post-installation notes from the DeterminedAI deployment, prior to running the next command.

Download the "Slim" weights by running the following commands in a terminal:

det cmd run 'wget --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ -P /mnt/finetune-gpt-neox/20B_checkpoints'

Important

Ensure that the above command completes executing. Depending on your network bandwidth, downloading weights can take up to an hour or two for 39GB of data. You can monitor the logs of the above command using the logs command: det task logs -f <TASK_NAME_FROM_ABOVE>

Download the training examples

DeterminedAI provides training examples on GitHub. Clone the source code for the DeterminedAI from their repository in a terminal in an acccesible path:

$ git clone https://github.com/determined-ai/determined.git

The deployment configurations for the experiments and the source code to run the finetuning job are located in the GPT-NeoX example directory under examples/deepspeed/gpt_neox.

Additional Resources

EleutherAI provides a lot of useful information on their provided datasets, which may be helpful when configuring datasets and training parameters for tensor and pipeline parallelism for finetuning using GPT-NeoX.

Navigate from the root of the determined.ai source code you cloned previously to the examples/deepspeed/gpt_neox directory.

Review and replace the contents of the original determined-cluster.yml file with the content below to configure the cluster for 96 GPUs in examples/deepspeed/gpt_neox/gpt_neox_config/determined-cluster.yml. You may configure or change any of the optimizer values or training configurations to your needs. It is recommended to use the NeoX source code as reference when doing so.

Click to expand - determined-cluster.yml
{
  # Tokenizer /  checkpoint settings - you will need to change these to the location you have them saved in
  "vocab-file": "/mnt/finetune-gpt-neox/20B_checkpoints/20B_tokenizer.json",
  
  # NOTE: You can make additional directories to load and save checkpoints
  "load": "/mnt/finetune-gpt-neox/20B_checkpoints",
  "save": "/mnt/finetune-gpt-neox/20B_checkpoints",

  # NOTE: This is the default dataset. Please change it to your dataset.
  "data-path": "/mnt/finetune-gpt-neox/gpt_finetune/mydataset_text_document",

  # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
  # across the node boundaries )
  "pipe-parallel-size": 4,
  "model-parallel-size": 2,
  "finetune": true, 

  # model settings
  "num-layers": 44,
  "hidden-size": 6144,
  "num-attention-heads": 64,
  "seq-length": 2048,
  "max-position-embeddings": 2048,
  "norm": "layernorm",
  "pos-emb": "rotary",
  "rotary_pct": 0.25,
  "no-weight-tying": true,
  "gpt_j_residual": true,
  "output_layer_parallelism": "column",
  "scaled-upper-triang-masked-softmax-fusion": true,
  "bias-gelu-fusion": true,

  # init methods
  "init_method": "small_init",
  "output_layer_init_method": "wang_init",

  # optimizer settings
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.97e-4,
      "betas": [0.9, 0.95],
      "eps": 1.0e-8,
      }
      },

  "min_lr": 0.97e-5,
  "zero_optimization": {
  "stage": 1,
  "allgather_partitions": True,
  "allgather_bucket_size": 1260000000,
  "overlap_comm": True,
  "reduce_scatter": True,
  "reduce_bucket_size": 1260000000,
  "contiguous_gradients": True,
  "cpu_offload": False
  },

  # batch / data settings (assuming 96 GPUs)
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 32,
  "data-impl": "mmap",
  "split": "995,4,1",

  # activation checkpointing
  "checkpoint-activations": true,
  "checkpoint-num-layers": 1,
  "partition-activations": false,
  "synchronize-each-layer": true,

  # regularization
  "gradient_clipping": 1.0,
  "weight-decay": 0.01,
  "hidden-dropout": 0,
  "attention-dropout": 0,

  # precision settings
  "fp16": {
    "fp16": true,
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 12,
    "hysteresis": 2,
    "min_loss_scale": 1
    },

  # misc. training settings
  "train-iters": 150000,
  "lr-decay-iters": 150000,

  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
  "save-interval": 50,
  "eval-interval": 100,
  "eval-iters": 10,

  # logging
  "log-interval": 2,
  "steps_per_print": 2,
  "wall_clock_breakdown": false,

  ### NEW DATA: ####
  "tokenizer_type": "HFTokenizer",
  "tensorboard-dir": "./tensorboard",
  "log-dir": "./logs",

}

Create the experiment

Note

You will need to be in the examples/deepspeed/gpt_neox directory

Copy the below configuration into a file called finetune-gpt-neox.yml

Click to expand - finetune-gpt-neox.yml
name: gpt-neox-zero1-3d-parallel
debug: false
profiling:
    enabled: true
    begin_on_batch: 1
    end_after_batch: 100
    sync_timings: true
hyperparameters:
  search_world_size: false
  conf_dir: /gpt-neox/configs
  conf_file:
      - determined_cluster.yml
  overwrite_values:
    pipe_parallel_size: 4
  wandb_group: null
  wandb_team: null
  user_script: null
  eval_tasks: null
environment:
  environment_variables:
      - NCCL_SOCKET_IFNAME=ens,eth,ib
  force_pull_image: true
  image:
    gpu: liamdetermined/gpt-neox
resources:
  slots_per_trial: 96 # Utilize 96 GPUs for the finetune
searcher:
  name: single
  metric: lm_loss
  smaller_is_better: false
  max_length:
    batches: 100
min_validation_period:
    batches: 50
max_restarts: 0
entrypoint:
  - python3
  - -m
  - determined.launch.deepspeed
  - --trial
  - gpt2_trial:GPT2Trial

Note

Many of the parameters in the above configuration can be changed, such as batches, and slots_per_trail. We use default values of 100 batches to fine-tune on with 50 batches before validation or early stopping, and 96 A40 GPUs .

Run the following command to launch the experiment:

det experiment create finetune-gpt-neox.yml .

The experiment is now launched! You can see the status of your experiment and monitor logs as well using the Web UI.

You should see an "Active" status for your experiment:

You can visualize and monitor logs:

Once training is completed, you will have access to the checkpoint in your S3 bucket for downstream tasks such as inference, transfer learning or model ensembles.

(Optional) Wandb.ai visualization of training graphs

Weights & Biases AI (Wandb.ai) can be installed and configured to visualize training graphs.

Pass in the <WANDB_GROUP> and <WANDB_TEAM> variables to your configuration file.

name: gpt-neox-zero1-3d-parallel
debug: false
profiling:
    enabled: true
    begin_on_batch: 1
    end_after_batch: 100
    sync_timings: true
hyperparameters:
  search_world_size: false
  conf_dir: /gpt-neox/configs
  conf_file:
      - determined_cluster.yml
  wandb_group: <WANDB_GROUP>
  wandb_team: <WANDB_TEAM>
environment:
  environment_variables:
      - NCCL_DEBUG=INFO
      - NCCL_SOCKET_IFNAME=ens,eth,ib
  force_pull_image: true
  image:
    gpu: liamdetermined/gpt-neox
resources:
  slots_per_trial: 96 # Utilize 96 GPUs for the finetune
searcher:
  name: single
  metric: lm_loss
  smaller_is_better: false
  max_length:
    batches: 100
min_validation_period:
    batches: 50
max_restarts: 0
entrypoint:
  - python3
  - -m
  - determined.launch.deepspeed
  - --trial
  - gpt2_trial:GPT2Trial

Last updated