Training
Welcome to Machine Learning on CoreWeave Cloud
Training Machine Learning models, especially models of modern Deep Neural Networks, is at the center of CoreWeave Cloud's architecture. The entire CoreWeave Cloud stack is purpose-built to enable highly scalable, cost-efficient model training.
Our bare-metal nodes sport a wide range of NVIDIA GPUs to offer top-of-the-line intensive compute power.
💪
The CoreWeave network stack features InfiniBand Interconnect, allowing for extremely fast, low-latency network connections.
🌐
High-performance, network-attached storage loads and writes checkpoints at terabit speeds to our software control plane, enabling large distributed training jobs to be scaled up in seconds.
⚡
In addition to our core tech stack, CoreWeave has a history of supporting our customers with cutting-edge Machine Learning research through in-house expertise, industry partnerships, and contributions to research organizations. Our team has extensive experience in training large transformer architectures, optimizing Hugging Face code, and selecting the right hardware for any given job.
Did You Know?
CoreWeave partnered with the open source research collective EleutherAI to develop and train the worlds largest open source NLP Model, GPT-NeoX-20B!
There are many ways to run Machine Learning tasks on CoreWeave, and, as is typical of the space, there are many methods to achieve the same result. These are the tools CoreWeave offers to achieve Machine Learning tasks:
The most vanilla of our compute offerings. Virtual Machines are great for few-GPU experimentation in a familiar environment, but administrative and performance overheads make them less desirable for distributed tasks.
Our Kubernetes offering differs from most of the other leading cloud providers by offering a fully managed cluster with thousands of GPUs pre-populated. There is no need to worry about cluster scaling, or idle virtual machines incurring cost. You only pay for exactly what you use. Kubernetes access gives experienced MLOps teams the power to deploy their own stacks in a bare metal container environment. The Kubernetes environment fully supports massive distributed training on our NVIDIA A100 HGX clusters with RDMA GPUDirect InfiniBand.
A experiment-oriented MLOps platform featuring hyperparameter search. Determined.AI runs experiments in containers on CoreWeave Kubernetes, abstracting the DevOps portions via a simple CLI and UI. DeterminedAI can be deployed with a single click from the CoreWeave application Catalog. DeterminedAI is a great way to run large distributed training jobs with support for most popular frameworks.
Our favorite workflow runner can easily be tasked to train or fine-tune a model and automatically deploy an Inference endpoint when finished.
A Kubernetes-native way to run distributed training jobs. Supports Tensorflow, PyTorch Distributed and any MPI style framework such as DeepSpeed.
SLURM is the incumbent job scheduler of the HPC world. Sunk, CoreWeave's implementation of SLURM on Kubernetes, allows you to leverage the resource management of Sunk on Kubernetes.
Note
SLURM support is currently available for reserved instance customers only. Please contact support for more information.
Last modified 8d ago