Launching GPT DeepSpeed Models using DeterminedAI
Launch a GPT DeepSpeed model using DeterminedAI on CoreWeave Cloud
In the example below, a minimal GPT-NeoX DeepSpeed distributed training job is launched without the additional features such as tracking, metrics, and visualization that DeterminedAI offers.
$ git clone --recurse-submodules https://github.com/coreweave/gpt-det-deepseed.git
To install DeterminedAI, log in to your CoreWeave Cloud account and navigate to the applications Catalog. From here, search for and locate the DeterminedAI application. Click into it to configure the instance from the deployment screen, then deploy the instance into your cluster by clicking the Deploy button.
Once the instance is shown as
Ready, you may proceed with the experiment.
In the configuration file
gpt_neox_config/small.yml, change the
saveparameters to the appropriate paths.
The launcher configuration file provided in this demo (
launcher.yml) exposes the overall configuration parameters for the experiment.
In this example, we're using a wrapper around DeepSpeed called
determined.launch.deepspeedin order to allow for safe handling of note failure and shutdown.
train_deepspeed_launcher.py, the default mount path is defined as:
shared_hostfile = "/mnt/finetune-gpt-neox/hostfile.txt"
Configure this hostfile path to your mount path.
The Dockerfile provided in this experiment is used to build the Docker image used to run the experiment in the cluster. The image may be manually built if customizations are desired.
The Dockerfile uses the following:
- Python 3.8
- PyTorch 1.12.1
- CUDA 11.6
To run the experiment, invoke
det experiment createfrom the root of the cloned repository.
$ det experiment create core_api.yml .
Logs for this experiment may be tracked using the DeterminedAI Web UI, but metrics may also be visualized using Weights & Biases (WandB). To use WandB, pass your WandB API key to an environment variable called
WANDB_API_KEY, or else modify the function
deepy.pyto return your API Token.