Model Training on CoreWeave
Welcome to model training on CoreWeave Cloud
By answering the questions posed in each section of this guide, your team will be better prepared for conversations with a CoreWeave Support Engineer if needed.
CoreWeave's infrastructure is optimized for machine learning use cases. Because of this, model training setups on CoreWeave are often different to those deployed on other Cloud platforms.
The following serves as both an onboarding checklist and a best practices guide, including overviews of CoreWeave solutions for model training and fine-tuning.
This guide also covers CoreWeave-specific considerations and recommended methods for model training, which are equally useful for creating new training workflows as well as for migrating existing ones from other platforms onto CoreWeave.
Solutions overview
The following overviews offer brief descriptions of each of CoreWeave's solutions for model training and fine-tuning. To learn more about any solution, click the Learn more about... card provided in each section.
Storage
- What kind of storage solution are you using for training code, datasets, and training checkpoints?
Prior to being loaded, datasets, training code, and training checkpoints must be stored somewhere, whether remotely or locally to the hardware.
CoreWeave solutions
Product | Best for... |
---|---|
CoreWeave Object Storage | ...storing training code, training checkpoints |
CoreWeave Accelerated Object Storage | ...storing training code, training checkpoints, model weights |
All-NVMe network-attached storage | ...storing training code, datasets |
CoreWeave Object Storage
Best for: Training code, training checkpoints
CoreWeave Object Storage is an S3-compatible solution that allows data to be stored and retrieved in a flexible and efficient way, featuring multi-region support, easy start-up, and simple SDK integrations.
CoreWeave Accelerated Object Storage
Best for: Training code, training checkpoints, model weights
CoreWeave's Accelerated Object Storage is a series of Anycasted NVMe-backed storage caches that provide blazing fast download speeds. It is ideal for storing training code, training checkpoints, and model weights. Training code in particular may best be served using Accelerated Object Storage.
All-NVMe, network-attached storage
Best for: Training code, datasets
Block storage volumes served from the high performance all-NVMe storage tier are an ideal solution for dataset or training code storage. These virtual disks readily outperform local workstation SSDs, and are scalable up to the Petabyte range.
Presented to the Operating System as generic block devices, they are treated as a traditional, physically-connected storage device.
Compute
- What types and sizes of GPUs run your training experiments?
High-end compute is imperative for model training and fine-tuning. CoreWeave specializes in providing several types of high-end GPUs and CPUs for model training and fine-tuning.
CoreWeave solutions
Solution | See also |
---|---|
CoreWeave GPUs and CPUs | See our GPU selection guide and benchmark comparisons for additional details on compute offerings. |
GPU and CPU nodes
CoreWeave's entire infrastructure stack is designed with model training and inference in mind. Node hardware is served from high-end data center regions across North America, and is purpose-built for HPC workloads. An extensive selection of high performance GPUs and CPUs are available for model training uses, including NVIDIA HGX H100s.
Node type availability is contingent upon contract type.
NVIDIA HGX H100s
Due to high demand, A100 NVLINK (HGX) and H100 NVLINK (HGX) nodes are currently fully committed on client contracts, and are therefore not currently available for on-demand use.
We recommend a conversation with the CoreWeave team to build a strategic plan catered to your needs to make use of available infrastructure and to plan for your future capacity requirements. Contact CoreWeave Sales to get started.
The NVIDIA HGX H100 enables up to seven times more efficient high-performance computing (HPC) applications, up to nine times faster AI training on large models, and up to thirty times faster AI inference than the NVIDIA HGX A100.
This speed, combined with the lowest NVIDIA GPUDirect network latency in the market with the NVIDIA Quantum-2 InfiniBand platform, reduces the training time of AI models to "days or hours, instead of months." With AI permeating nearly every industry today, this speed and efficiency has never been more vital for HPC applications.
Workload managers
- What kind of workload management solutions are you currently using, or, which would you like to use for managing training tasks?
- Are you using any MLOps tools in your workflow?
Workload managers provide high-level interfaces for launching and managing training workflow tasks, allowing training tasks to be conducted at scale.
CoreWeave solutions
Solution | Best for when... |
---|---|
CoreWeave Kubernetes | ...your team uses Kubernetes for workload management. |
SUNK | ...your team uses Slurm to manage workloads, and would like to continue to do so, while also leveraging Kubernetes. |
Virtual Servers | ...your team currently uses virtual machines to manage workloads, and wants to continue to do so. |
CoreWeave Kubernetes
CoreWeave Kubernetes is the most popular workload management solution on CoreWeave, and the base layer for all compute usage on CoreWeave.
If you are already using Kubernetes to manage your workflows, CoreWeave Kubernetes is an easy transition to make. Unlike managed Kubernetes products offered by other Cloud providers, CoreWeave Kubernetes is optimized for machine learning applications, reducing the need for manual configurations while still maintaining configuration flexibility. With CoreWeave Kubernetes, all the benefits of container orchestration are maintained, without losing the fidelity of bare metal performance on high-end GPUs.
On CoreWeave Kubernetes, managing nodes isn't a requirement - workloads can be run without worrying about managing the underlying infrastructure. Like any other node type, GPU nodes can experience failures and other issues. CoreWeave automatically detects these issues, and cordons off problematic nodes so your workloads are not be scheduled to them until the problem is resolved.
SUNK: Slurm on Kubernetes
If you are currently managing workloads utilizing Slurm, but would also like to use CoreWeave Kubernetes, CoreWeave's SUNK (Slurm on Kubernetes) offers a solution.
With SUNK, each "node" is actually a Kubernetes Pod running a slurmd
container. SUNK node definitions provide an easy, fast, and declarative way to define and scale your Slurm cluster. Node definitions expose Kubernetes resource requests, limits, and node affinities, and replica counts, allowing you to dynamically deploy only exactly what is needed, then tear it down after.
Each node provides a mechanism for specifying the desired container image. This allows you to build lightweight, purpose-built containers for consistency between development environments and rapid scaling.
CoreWeave Virtual Servers
If you are currently using virtual machines to manage workloads, CoreWeave Virtual Servers are likely the most comparable for a direct one-to-one transition, although other solutions may be more appropriate for your use case.
Virtual Servers are the most "vanilla" method for managing workloads. If your workflow is currently managed on virtual machines, this may be the most one-to-one translation, if Kubernetes or another workload management system is not desirable for some reason.
Administrative and performance overheads may make Virtual Servers a less desirable solution for distributed task management than other solutions.
Serialization
In machine learning, serialization is the process of converting a data object (e.g. a model) into a format that can be stored or transmitted easily. Model serialization can significantly improve network load times as well as load times from local disk volumes, especially when using very large models.
- Are your models serialized, or, should they be?
CoreWeave solutions
Solution | Description |
---|---|
CoreWeave Tensorizer | CoreWeave's PyTorch module, model, and tensor serializer and deserializer. |
CoreWeave Tensorizer
Best for: Large models
CoreWeave's Tensorizer is a serializer and deserializer for modules, models, and tensors, which makes it possible to load even very large models in less than five seconds, for easier, more flexible, and more cost-efficient methods of serving models at scale.
MLOps
- Are you currently using, or will you need to use, any MLOps tooling to run training jobs?
MLOps encompasses platforms and tools that leverage automation to enable and manage complex, multi-step training workflows by providing tools that abstract some of the low-level logic, as well as a UX to make training intuitive, and to provide a graphical interface for workload visualization and monitoring.
CoreWeave supports several popular MLOps tools and platforms, though not yet all that exist. Below are the MLOps tools and platforms that are currently supported on CoreWeave Cloud.
CoreWeave solutions
Solution | Use case |
---|---|
Kubeflow Training Operators | For creating ML workflows on Kubernetes that are simple, portable, and scalable. |
Argo Workflows | For defining, executing, and managing complex, multi-step workflows as code. |
Determined AI | For assisting with the deployment of model training tools and frameworks. |
Mosaic ML | For general help with model training, including some additional specialized distributed training strategies. |
Weights & Biases
Weights & Biases (WandB) is a machine learning platform that enables developers to build models faster. The WandB platform provides a best-in-class toolkit to track experiments, iterate on datasets, evaluate model performance, reproduce models, and manage machine learning workflows end-to-end.
CoreWeave partners with Weights & Biases to bring insight into your jobs and experiments running on CoreWeave Cloud infrastructure, enabling a deeper understanding of how your models perform. Developers can deploy Weights & Biases tools - like the Launch agent - directly onto CoreWeave infrastructure, which submits jobs and monitors runs on Weights & Biases, in addition to visualizing experiment progress and metric monitoring.
Kubeflow training operators
CoreWeave does not currently support all of Kubeflow - only its training operators.
The Kubeflow project is dedicated to making deployments of Machine Learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.
CoreWeave Cloud supports running Kubeflow Training Operators to easily train your Machine Learning models across a variety of frameworks and backends.
Argo Workflows
Argo Workflows is a powerful, open-source workflow management system available in the CoreWeave Applications Catalog.
It's used to define, execute, and manage complex, multi-step workflows in a code-based manner. It's developed and maintained as a Cloud Native Computing Foundation (CNCF) Graduated project, and uses the principles of Cloud-native computing to ensure scalability, resiliency, and flexibility.
Determined AI
Determined AI is a deep learning training platform that makes building models fast and easy. With high-performance distributed training, state-of-the-art hyperparameter tuning, GPU scheduling, and model management in a single integrated environment, Determined lets researchers and scientists focus on building models instead of managing infrastructure.
MosaicML
The MosaicML platform provides organizations with a high-performance model training interface and inference service, leveraging CoreWeave Cloud while abstracting away the complexity of generative AI model development and deployment.
Distributed training
Distributed training workflows are workflows that use more than a single GPU.
- Are you currently using, or will you need to set up, distributed training?
CoreWeave solutions
Solution | Description |
---|---|
CoreWeave Object Storage | Best for storing training code, training checkpoints, and model weights. |
CoreWeave Accelerated Object Storage | Anycasted NVMe-backed Object Storage, best for storing training code, training checkpoints, and model weights. |
All-NVMe network-attached storage | Best for storing datasets and training code. |
As distributed training performance relies heavily on interconnect solutions, CoreWeave also provides several solutions for HPC interconnect, including NVLink, InfiniBand with GPUDirect, and SHARP.
Distributed training is when multiple nodes, and therefore multiple GPUs, are used to train a single model. This method is almost always necessary for training very large models.
CoreWeave offers several solutions for multi-node, multi-GPU training.
Distributed training may also be accomplished through the use of supported MLOps tools. Explore our integration partners for even more distributed training solutions.
NVLink
Many of CoreWeave's GPUs are enabled with NVIDIA NVLink GPU interconnect. With a special wiring array and software component, NVLink enables high-speed hardware connectivity by leveraging shared pools of memory, allowing GPUs to send and receive data extremely quickly. NVLink provides a significantly faster alternative for connecting multi-GPU systems compared to traditional PCIe-based solutions.
To select a GPU with NVLink capability, look for the node types with NVLink
in their titles and labels.
InfiniBand
CoreWeave has partnered with NVIDIA in its design of interconnect for A100 HGX training clusters. All CoreWeave A100 NVLINK GPUs offer GPUDirect RDMA over InfiniBand, in addition to standard IP/Ethernet networking.
GPUDirect allows GPUs to communicate directly with other GPUs across an InfiniBand fabric, without passing through the host system CPU and operating system kernel, significantly lowering synchronization latency.
SHARP
Traditionally, communication requirements scale proportionally with number of nodes in a HPC cluster. NVIDIA® Mellanox® Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) moves collection operations from individual nodes into the network. This allows for a flat scaling curve and significantly improved effective interconnect bandwidth.
Need help?
If you've determined the needs for your use case, but still need additional assistance, or aid in building custom configurations for your workflow, please reach out to CoreWeave support.