ML container images - CoreWeave Docs

CoreWeave maintains a set of optimized machine learning container images, tuned for the CoreWeave platform, that you can use as a starting point for distributed training and other GPU workloads. This page describes the available images, what each one contains, and how to use them on CoreWeave Kubernetes Service (CKS) and SUNK. The images are published to ghcr.io/coreweave/ml-containers and built from the public coreweave/ml-containers repository, where you can inspect the Dockerfiles to see exactly what each image installs.

Available images

The following PyTorch images are the recommended starting points for most customers.

Image	Description	Recommended for
`torch`	A custom build of PyTorch, torchvision, and torchaudio tuned for the CoreWeave platform.	A smaller starting point with the core PyTorch stack.
`torch-extras`	The `torch` image plus a set of common PyTorch extensions.	Distributed training and LLM training. This is the recommended default.
`nightly-torch`	An experimental, daily release channel that tracks the latest development versions of PyTorch.	Testing the latest features, not production.
`nightly-torch-extras`	The PyTorch extensions built on top of `nightly-torch`.	Testing the latest features, not production.

For most training workloads, start with torch-extras. If you want a smaller image with only the core PyTorch stack, use torch. Use the nightly images only for testing.

To browse every published image and its tags, see the packages list.

PyTorch base images (torch)

The ml-containers/torch image contains custom builds of PyTorch, torchvision, and torchaudio, each tuned for use on the CoreWeave platform. Each image is built on an Ubuntu LTS release. The image tag indicates the Ubuntu version, which in turn determines the Python version.

Image variants

CoreWeave builds two variants of the torch image. Both variants are also available for torch-extras.

base: Includes only the essentials (CUDA, torch, torchvision, and torchaudio). This variant has a small image size, which makes it fast to launch.
nccl: Includes the development libraries and build tools, such as nvcc, that are required to compile other PyTorch extensions. This variant is larger than base.

The nccl variant is built on component libraries optimized for the CoreWeave platform. For more details, see coreweave/nccl-tests.

PyTorch extras (torch-extras)

The ml-containers/torch-extras image extends the torch image with a set of common PyTorch extensions, including DeepSpeed, xformers, and NVIDIA Apex. (FlashAttention is already included in the base torch image.) Each extension is compiled against the custom PyTorch builds in the torch image. For the complete, current list of included extensions, see the coreweave/ml-containers repository. Both the base and nccl variants are available for torch-extras, matching those provided for torch. The base variant stays small because it uses a multi-stage build that avoids including CUDA development libraries, even though those libraries are required to build the extensions. Customers running supervised fine-tuning, reinforcement learning, pretraining, or any multi-node PyTorch training should start with torch-extras.

Nightly images

The nightly-torch image is an experimental, nightly release channel of the PyTorch base images, in the style of PyTorch’s own nightly preview builds. It features the latest development versions of torch, torchvision, and torchaudio, pulled daily and compiled from source. The nightly-torch-extras image builds the PyTorch extensions on top of nightly-torch.

The nightly images are based on unstable, experimental preview builds of PyTorch and can contain bugs and other issues. For production workloads, use the torch or torch-extras images instead.

Choose an image tag

Image tags encode the component versions in each build. For example:

8a60b2d-nccl-cuda12.9.1-ubuntu22.04-nccl2.28.3-1-torch2.8.0-vision0.23.0-audio2.8.0-abi1

Key fields in a tag include:

The variant, either base or nccl.
The CUDA version, for example cuda12.9.1.
The Ubuntu version, for example ubuntu22.04.
The PyTorch, torchvision, and torchaudio versions, for example torch2.8.0, vision0.23.0, and audio2.8.0.
The NCCL version and the ABI version.

Because tags change as CoreWeave publishes new builds, always get the current tag from the packages list.

Match the CUDA version to your GPU driver

Choose an image whose CUDA version is compatible with the GPU driver on your nodes. Don’t assume the newest image is the right one. A recently published image can use a CUDA version that’s newer than your nodes’ driver supports. When this happens, workloads fail to start with driver-compatibility errors. You can check the driver version on a node by running nvidia-smi.

Use an image

After you’ve chosen an image and a tag, you can use an ML container image as a base for your own custom image, or run it directly on CKS or SUNK. In the following examples, replace [TAG] with a tag from the packages list.

Build a custom image

To add your own dependencies, use an ML container image as the base image in a Dockerfile:

FROM ghcr.io/coreweave/ml-containers/torch-extras:[TAG]

# Install your additional dependencies
RUN pip install --no-cache-dir my-package

Run on CKS

Reference the image in the image field of a Pod specification:

apiVersion: v1
kind: Pod
metadata:
  name: pytorch-training
spec:
  restartPolicy: Never
  containers:
    - name: trainer
      image: ghcr.io/coreweave/ml-containers/torch-extras:[TAG]
      command: ["python", "train.py"]
      resources:
        limits:
          nvidia.com/gpu: "8"

Run on SUNK

SUNK uses Pyxis and enroot to run containers. Pass the image to srun with the --container-image flag. In the container URI, a # separates the registry host from the image path:

srun --container-image=ghcr.io#coreweave/ml-containers/torch-extras:[TAG] \
  --container-mounts=/mnt/home:/mnt/home \
  --pty bash -i

On CoreWeave-managed Nodes, running containers with enroot requires the CoreWeave AppArmor profile. See Enroot apparmor for setup. For a complete walkthrough of running a distributed training job on SUNK, see Submit a training job.

Additional resources

For more information, see the following resources:

coreweave/ml-containers repository: Dockerfiles and source for all images.
Packages list: every published image and its current tags.
Slurm images: the SUNK-built slurm-containers images for the Slurm control plane and nodes.
Create custom images: customize a published image for SUNK.
Introduction to third-party frameworks: frameworks supported on CKS and SUNK.

​Available images

​PyTorch base images (torch)

​Image variants

​PyTorch extras (torch-extras)

​Nightly images

​Choose an image tag

​Match the CUDA version to your GPU driver

​Use an image

​Build a custom image

​Run on CKS

​Run on SUNK

​Additional resources

Available images

PyTorch base images (torch)

Image variants

PyTorch extras (torch-extras)

Nightly images

Choose an image tag

Match the CUDA version to your GPU driver

Use an image

Build a custom image

Run on CKS

Run on SUNK

Additional resources