Model Training on CoreWeave

Welcome to model training on CoreWeave Cloud

Tip

By answering the questions posed in each section of this guide, your team will be better prepared for conversations with a CoreWeave Support Engineer if needed.

CoreWeave's infrastructure is optimized for machine learning use cases. Because of this, model training setups on CoreWeave are often different to those deployed on other Cloud platforms.

The following serves as both an onboarding checklist and a best practices guide, including overviews of CoreWeave solutions for model training and fine-tuning.

This guide also covers CoreWeave-specific considerations and recommended methods for model training, which are equally useful for creating new training workflows as well as for migrating existing ones from other platforms onto CoreWeave.

Solutions overview

The following overviews offer brief descriptions of each of CoreWeave's solutions for model training and fine-tuning. To learn more about any solution, click the Learn more about... card provided in each section.

Storage

  • What kind of storage solution are you using for training code, datasets, and training checkpoints?

Prior to being loaded, datasets, training code, and training checkpoints must be stored somewhere, whether remotely or locally to the hardware.

CoreWeave solutions

ProductBest for...

...storing training code, training checkpoints

...storing training code, training checkpoints, model weights

...storing training code, datasets

CoreWeave Object Storage

Best for: Training code, training checkpoints

CoreWeave Object Storage is an S3-compatible solution that allows data to be stored and retrieved in a flexible and efficient way, featuring multi-region support, easy start-up, and simple SDK integrations.

CoreWeave Accelerated Object Storage

Best for: Training code, training checkpoints, model weights

CoreWeave's Accelerated Object Storage is a series of Anycasted NVMe-backed storage caches that provide blazing fast download speeds. It is ideal for storing training code, training checkpoints, and model weights. Training code in particular may best be served using Accelerated Object Storage.

All-NVMe, network-attached storage

Best for: Training code, datasets

Block storage volumes served from the high performance all-NVMe storage tier are an ideal solution for dataset or training code storage. These virtual disks readily outperform local workstation SSDs, and are scalable up to the Petabyte range.

Presented to the Operating System as generic block devices, they are treated as a traditional, physically-connected storage device.

Compute

  • What types and sizes of GPUs run your training experiments?

High-end compute is imperative for model training and fine-tuning. CoreWeave specializes in providing several types of high-end GPUs and CPUs for model training and fine-tuning.

CoreWeave solutions

See our GPU selection guide and benchmark comparisons for additional details on compute offerings.

GPU and CPU nodes

CoreWeave's entire infrastructure stack is designed with model training and inference in mind. Node hardware is served from high-end data center regions across North America, and is purpose-built for HPC workloads. An extensive selection of high performance GPUs and CPUs are available for model training uses, including NVIDIA HGX H100s.

Note

Node type availability is contingent upon contract type.

NVIDIA HGX H100s

Important

Due to high demand, A100 NVLINK (HGX) and H100 NVLINK (HGX) nodes are currently fully committed on client contracts, and are therefore not currently available for on-demand use.

We recommend a conversation with the CoreWeave team to build a strategic plan catered to your needs to make use of available infrastructure and to plan for your future capacity requirements. Contact CoreWeave Sales to get started.

The NVIDIA HGX H100 enables up to seven times more efficient high-performance computing (HPC) applications, up to nine times faster AI training on large models, and up to thirty times faster AI inference than the NVIDIA HGX A100.

This speed, combined with the lowest NVIDIA GPUDirect network latency in the market with the NVIDIA Quantum-2 InfiniBand platform, reduces the training time of AI models to "days or hours, instead of months." With AI permeating nearly every industry today, this speed and efficiency has never been more vital for HPC applications.

Workload managers

  • What kind of workload management solutions are you currently using, or, which would you like to use for managing training tasks?

  • Are you using any MLOps tools in your workflow?

Workload managers provide high-level interfaces for launching and managing training workflow tasks, allowing training tasks to be conducted at scale.

CoreWeave solutions

SolutionBest for when...

...your team uses Kubernetes for workload management.

...your team uses Slurm to manage workloads, and would like to continue to do so, while also leveraging Kubernetes.

...your team currently uses virtual machines to manage workloads, and wants to continue to do so.

CoreWeave Kubernetes

CoreWeave Kubernetes is the most popular workload management solution on CoreWeave, and the base layer for all compute usage on CoreWeave.

If you are already using Kubernetes to manage your workflows, CoreWeave Kubernetes is an easy transition to make. Unlike managed Kubernetes products offered by other Cloud providers, CoreWeave Kubernetes is optimized for machine learning applications, reducing the need for manual configurations while still maintaining configuration flexibility. With CoreWeave Kubernetes, all the benefits of container orchestration are maintained, without losing the fidelity of bare metal performance on high-end GPUs.

On CoreWeave Kubernetes, managing nodes isn't a requirement - workloads can be run without worrying about managing the underlying infrastructure. Like any other node type, GPU nodes can experience failures and other issues. CoreWeave automatically detects these issues, and cordons off problematic nodes so your workloads are not be scheduled to them until the problem is resolved.

SUNK: Slurm on Kubernetes

If you are currently managing workloads utilizing Slurm, but would also like to use CoreWeave Kubernetes, CoreWeave's Open Source project SUNK (Slurm on Kubernetes) offers a solution.

With SUNK, each "node" is actually a Kubernetes Pod running a slurmd container. SUNK node definitions provide an easy, fast, and declarative way to define and scale your Slurm cluster. Node definitions expose Kubernetes resource requests, limits, and node affinities, and replica counts, allowing you to dynamically deploy only exactly what is needed, then tear it down after.

Each node provides a mechanism for specifying the desired container image. This allows you to build lightweight, purpose-built containers for consistency between development environments and rapid scaling.

CoreWeave Virtual Servers

If you are currently using virtual machines to manage workloads, CoreWeave Virtual Servers are likely the most comparable for a direct one-to-one transition, although other solutions may be more appropriate for your use case.

Virtual Servers are the most "vanilla" method for managing workloads. If your workflow is currently managed on virtual machines, this may be the most one-to-one translation, if Kubernetes or another workload management system is not desirable for some reason.

Note

Administrative and performance overheads may make Virtual Servers a less desirable solution for distributed task management than other solutions.

Serialization

In machine learning, serialization is the process of converting a data object (e.g. a model) into a format that can be stored or transmitted easily. Model serialization can significantly improve network load times as well as load times from local disk volumes, especially when using very large models.

  • Are your models serialized, or, should they be?

CoreWeave solutions

CoreWeave's PyTorch module, model, and tensor serializer and deserializer.

CoreWeave Tensorizer

Best for: Large models

CoreWeave's Tensorizer is a serializer and deserializer for modules, models, and tensors, which makes it possible to load even very large models in less than five seconds, for easier, more flexible, and more cost-efficient methods of serving models at scale.

MLOps

  • Are you currently using, or will you need to use, any MLOps tooling to run training jobs?

MLOps encompasses platforms and tools that leverage automation to enable and manage complex, multi-step training workflows by providing tools that abstract some of the low-level logic, as well as a UX to make training intuitive, and to provide a graphical interface for workload visualization and monitoring.

CoreWeave supports several popular MLOps tools and platforms, though not yet all that exist. Below are the MLOps tools and platforms that are currently supported on CoreWeave Cloud.

CoreWeave solutions

SolutionUse case

For creating ML workflows on Kubernetes that are simple, portable, and scalable.

For defining, executing, and managing complex, multi-step workflows as code.

For assisting with the deployment of model training tools and frameworks.

For general help with model training, including some additional specialized distributed training strategies.

Weights & Biases

Weights & Biases (WandB) is a machine learning platform that enables developers to build models faster. The WandB platform provides a best-in-class toolkit to track experiments, iterate on datasets, evaluate model performance, reproduce models, and manage machine learning workflows end-to-end.

CoreWeave partners with Weights & Biases to bring insight into your jobs and experiments running on CoreWeave Cloud infrastructure, enabling a deeper understanding of how your models perform. Developers can deploy Weights & Biases tools - like the Launch agent - directly onto CoreWeave infrastructure, which submits jobs and monitors runs on Weights & Biases, in addition to visualizing experiment progress and metric monitoring.

Kubeflow training operators

Important

CoreWeave does not currently support all of Kubeflow - only its training operators.

The Kubeflow project is dedicated to making deployments of Machine Learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.

CoreWeave Cloud supports running Kubeflow Training Operators to easily train your Machine Learning models across a variety of frameworks and backends.

Argo Workflows

Argo Workflows is a powerful, open-source workflow management system available in the CoreWeave Applications Catalog.

It's used to define, execute, and manage complex, multi-step workflows in a code-based manner. It's developed and maintained as a Cloud Native Computing Foundation (CNCF) Graduated project, and uses the principles of Cloud-native computing to ensure scalability, resiliency, and flexibility.

Determined AI

Determined AI is a deep learning training platform that makes building models fast and easy. With high-performance distributed training, state-of-the-art hyperparameter tuning, GPU scheduling, and model management in a single integrated environment, Determined lets researchers and scientists focus on building models instead of managing infrastructure.

MosaicML

The MosaicML platform provides organizations with a high-performance model training interface and inference service, leveraging CoreWeave Cloud while abstracting away the complexity of generative AI model development and deployment.

Distributed training

Distributed training workflows are workflows that use more than a single GPU.

  • Are you currently using, or will you need to set up, distributed training?

CoreWeave solutions

SolutionDescription

Best for storing training code, training checkpoints, and model weights.

Anycasted NVMe-backed Object Storage, best for storing training code, training checkpoints, and model weights.

Best for storing datasets and training code.

As distributed training performance relies heavily on interconnect solutions, CoreWeave also provides several solutions for HPC interconnect, including NVLink, InfiniBand with GPUDirect, and SHARP.

Distributed training is when multiple nodes, and therefore multiple GPUs, are used to train a single model. This method is almost always necessary for training very large models.

CoreWeave offers several solutions for multi-node, multi-GPU training.

Tip

Distributed training may also be accomplished through the use of supported MLOps tools. Explore our partners for even more distributed training solutions.

Many of CoreWeave's GPUs are enabled with NVIDIA NVLink GPU interconnect. With a special wiring array and software component, NVLink enables high-speed hardware connectivity by leveraging shared pools of memory, allowing GPUs to send and receive data extremely quickly. NVLink provides a significantly faster alternative for connecting multi-GPU systems compared to traditional PCIe-based solutions.

To select a GPU with NVLink capability, look for the node types with NVLink in their titles and labels.

InfiniBand

CoreWeave has partnered with NVIDIA in its design of interconnect for A100 HGX training clusters. All CoreWeave A100 NVLINK GPUs offer GPUDirect RDMA over InfiniBand, in addition to standard IP/Ethernet networking.

GPUDirect allows GPUs to communicate directly with other GPUs across an InfiniBand fabric, without passing through the host system CPU and operating system kernel, significantly lowering synchronization latency.

SHARP

Traditionally, communication requirements scale proportionally with number of nodes in a HPC cluster. NVIDIA® Mellanox® Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) moves collection operations from individual nodes into the network. This allows for a flat scaling curve and significantly improved effective interconnect bandwidth.

Need help?

If you've determined the needs for your use case, but still need additional assistance, or aid in building custom configurations for your workflow, please reach out to CoreWeave support.

Last updated