Launching GPT DeepSpeed Models using DeterminedAI

Launch a GPT DeepSpeed model using DeterminedAI on CoreWeave Cloud
DeepSpeed is an open source deep learning optimization library for PyTorch optimized for low latency, high throughput training, and is designed to reduce compute power and memory use for the purpose of training large distributed models.
In the example below, a minimal GPT-NeoX DeepSpeed distributed training job is launched without the additional features such as tracking, metrics, and visualization that DeterminedAI offers.
This guide makes several assumptions: • You have set up the CoreWeave Kubernetes environment. • You have some experience launching and using DeterminedAI on CoreWeave Cloud. (If you have not done so already, it is recommended to deploy DeterminedAI via the application Catalog to familiarize yourself with it). • You have git installed on your terminal.


Clone the files

To follow along with this example, first clone the the CoreWeave GPT DeepSpeed repository locally:
$ git clone --recurse-submodules

Install DeterminedAI

To install DeterminedAI, log in to your CoreWeave Cloud account and navigate to the applications Catalog. From here, search for and locate the DeterminedAI application. Click into it to configure the instance from the deployment screen, then deploy the instance into your cluster by clicking the Deploy button.
Once the instance is shown as Ready, you may proceed with the experiment.
In the configuration file gpt_neox_config/small.yml, change the vocab_path, data_path load, and save parameters to the appropriate paths.

The launcher configuration file

The launcher configuration file provided in this demo (launcher.yml) exposes the overall configuration parameters for the experiment.
DeterminedAI uses its own fork of DeepSpeed, so using that image is recommended.
gpu: liamdetermined/gpt-neox
In this example, we're using a wrapper around DeepSpeed called determined.launch.deepspeed in order to allow for safe handling of note failure and shutdown.
- python3
- -m
- determined.launch.deepspeed

Mount path for host file

In , the default mount path is defined as:
shared_hostfile = "/mnt/finetune-gpt-neox/hostfile.txt"
Configure this hostfile path to your mount path.


The Dockerfile provided in this experiment is used to build the Docker image used to run the experiment in the cluster. The image may be manually built if customizations are desired.
The Dockerfile uses the following:
  • Python 3.8
  • PyTorch 1.12.1
  • CUDA 11.6
Click to Expand - Example Dockerfile

Launch the experiment

To run the experiment, invoke det experiment create from the root of the cloned repository.
$ det experiment create core_api.yml .


Logs for this experiment may be tracked using the DeterminedAI Web UI, but metrics may also be visualized using Weights & Biases (WandB). To use WandB, pass your WandB API key to an environment variable called WANDB_API_KEY, or else modify the function get_wandb_api_key() in to return your API Token.
To configure your DeepSpeed experiment to run on multiple nodes, you can change the slots_per_trail option with the number of GPUs you require. The maximum number of GPUs per node on CoreWeave is 8, so the experiment will become multi-node once it reaches this threshold.