Model Training and Fine-Tuning

Welcome to Machine Learning on CoreWeave Cloud

Training Machine Learning models, especially models of modern Deep Neural Networks, is at the center of CoreWeave Cloud's architecture. The entire CoreWeave Cloud stack is purpose-built to enable highly scalable, cost-efficient model training.

Bare-metal nodes sport a wide range of NVIDIA GPUs to offer top-of-the-line intensive compute power.
The CoreWeave network stack features InfiniBand Interconnect, allowing for extremely fast, low-latency network connections.
High-performance, network-attached storage loads and writes checkpoints at terabit speeds to our software control plane, enabling large distributed training jobs to be scaled up in seconds.

In addition to its core tech stack, CoreWeave has a history of supporting our customers with cutting-edge Machine Learning research through in-house expertise, industry partnerships, and contributions to research organizations. Our team has extensive experience in training large transformer architectures, optimizing Hugging Face code, and selecting the right hardware for any given job.

Did You Know?

CoreWeave partnered with the open source research collective EleutherAI to develop and train the worlds largest open source NLP Model, GPT-NeoX-20B!

There are many ways to run Machine Learning tasks on CoreWeave, and, as is typical of the space, there are many methods to achieve the same result.

Frameworks and tools

🧠 Determined AI

Determined AI is an experiment-oriented MLOps platform featuring hyperparameter search. Experiments are run in containers on CoreWeave Kubernetes, abstracting the DevOps portions via a simple CLI and UI, and can be deployed with a single click from the CoreWeave Application Catalog. Determined AI is a great way to run large distributed training jobs, with support for most popular frameworks.

🦑 Argo Workflows

Our favorite workflow runner, Argo Workflows, can easily be tasked to train or fine-tune a model and automatically deploy an Inference endpoint when finished.

🏃‍♀️ Kubeflow Training Operators

Training Operators offer a Kubernetes-native way to run distributed training jobs. Supports Tensorflow, PyTorch Distributed and any MPI style framework such as DeepSpeed.

⚓ SUNK: Slurm on Kubernetes

Slurm is the de-facto scheduler for large HPC jobs in supercomputing centers, government laboratories, universities, and companies worldwide. It performs workload management for more than half of the fastest 10 systems on the TOP500 list.

SUNK (Slurm on Kubernetes) is an implementation of Slurm which is deployed on Kubernetes via a Helm chart.

Note

SUNK is currently available for reserved instance customers only. Please contact support for more information.

Frameworks and tools​

🧠 Determined AI​

🦑 Argo Workflows​

🏃‍♀️ Kubeflow Training Operators​

⚓ SUNK: Slurm on Kubernetes​

Frameworks and tools

🧠 Determined AI

🦑 Argo Workflows

🏃‍♀️ Kubeflow Training Operators

⚓ SUNK: Slurm on Kubernetes