Skip to main content

Launch GPT DeepSpeed Models using Determined AI

Launch a GPT DeepSpeed model using Determined AI on CoreWeave Cloud

DeepSpeed is an open-source deep learning library for PyTorch optimized for low latency and high throughput training, designed to reduce compute power and memory required to train large distributed models.

In the example below, a minimal GPT-NeoX DeepSpeed distributed training job is launched without the additional features such as tracking, metrics, and visualization that Determined AI offers.

Tutorial source code

To follow along with this example, first clone the CoreWeave GPT DeepSpeed repository to your workstation:

$ git clone --recurse-submodules


This guide assumes that the following are completed in advance.


The launcher configuration file

The launcher.yml configuration file provided in this demo exposes the overall configuration parameters for the experiment**.**


Determined AI uses its own fork of DeepSpeed, so using that image is recommended.

gpu: liamdetermined/gpt-neox

In this example, a wrapper around DeepSpeed called determined.launch.deepspeed allows for safe handling of note failure and shutdown.

- python3
- -m
- determined.launch.deepspeed

Mount path for host file

In , the default mount path is defined as:

shared_hostfile = "/mnt/finetune-gpt-neox/hostfile.txt"

Configure this hostfile path to your mount path.


The Dockerfile provided in this experiment is used to build the Docker image needed to run the experiment in the cluster. The image may be manually built if customizations are desired.

The Dockerfile uses the following:

  • Python 3.8
  • PyTorch 1.12.1
  • CUDA 11.6
Click to expand - Example Dockerfile
FROM coreweave/nccl-tests:2022-09-28_16-34-19.392_EDT

ENV DET_PYTHON_EXECUTABLE="/usr/bin/python3.8"

# Run updates and install packages for build
RUN echo "Dpkg::Options { "--force-confdef"; "--force-confnew"; };" > /etc/apt/apt.conf.d/local
RUN apt-get -qq update && \
apt-get -qq install -y --no-install-recommends software-properties-common && \
add-apt-repository ppa:deadsnakes/ppa -y && \
add-apt-repository universe && \
apt-get -qq update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y curl tzdata build-essential daemontools && \
apt-get install -y --no-install-recommends \
python3.8 \
python3.8-distutils \
python3.8-dev \
python3.8-venv \
git && \
apt-get clean

# python3.8 -m ensurepip --default-pip && \
RUN curl -o
RUN python3.8
RUN python3.8 -m pip install --no-cache-dir --upgrade pip


RUN python3.8 -m pip install --no-cache-dir install torch==${PYTORCH_VERSION}+cu${TORCH_CUDA} \

RUN python3.8 -m pip install --no-cache-dir install packaging

RUN mkdir -p /tmp/build && \
cd /tmp/build && \
git clone && \
cd apex && \
python3.8 -m pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ && \
cd /tmp && \
rm -r /tmp/build

#### Python packages
RUN python3.8 -m pip install --no-cache-dir determined==0.19.2
COPY requirements/requirements.txt .
RUN python3.8 -m pip install --no-cache-dir -r requirements.txt
COPY requirements/requirements-onebitadam.txt .
RUN python3.8 -m pip install --no-cache-dir -r requirements-onebitadam.txt
COPY requirements/requirements-sparseattention.txt .
RUN python3.8 -m pip install -r requirements-sparseattention.txt
RUN python3.8 -m pip install --no-cache-dir pybind11
RUN python3.8 -m pip install --no-cache-dir protobuf==3.19.4
RUN update-alternatives --install /usr/bin/python3 python /usr/bin/python3.8 2
RUN echo 2 | update-alternatives --config python

Launch the experiment

To run the experiment, invoke det experiment create from the root of the cloned repository.

$ det experiment create core_api.yml .


You can track logs for this experiment using the Determined AI web UI, and visualize metrics using Weights & Biases (WandB). To use WandB, pass your WandB API key to an environment variable called WANDB_API_KEY, or modify the function get_wandb_api_key() in to return your API Token.


To configure your DeepSpeed experiment to run on multiple nodes, change the slots_per_trail option to the number of GPUs you require. The maximum number of GPUs per node on CoreWeave is 8, so the experiment will become multi-node once it reaches this threshold.