CoreWeave
Search
K

Get Started with ML and AI

Learn what makes CoreWeave special for machine learning and AI applications
Machine learning is one of the most popular applications of CoreWeave Cloud's state-of-the-art, purpose-built infrastructure. Models are easily hosted on CoreWeave, and can be sourced from a range of storage backends including S3-compatible object storage, HTTP, or persistent Storage Volumes.

CoreWeave infrastructure for ML and AI

🖥
Virtual Servers

Virtual Servers are the most "vanilla" of CoreWeave's compute offerings. Virtual Servers are great for experimentation using few GPUs from a familiar environment, however administrative and performance overheads make them less desirable for distributed tasks.

Kubernetes

Our Kubernetes offering differs from most of the other leading cloud providers by offering a fully managed cluster with thousands of GPUs pre-populated. Kubernetes access gives experienced MLOps teams the power to deploy their own stacks in a bare metal container environment. The Kubernetes environment fully supports massive distributed training on our NVIDIA A100 HGX clusters with RDMA GPUDirect InfiniBand. Plus, there's no need to worry about cluster scaling or idle virtual machines incurring costs - charges are incurred only for what is actually used.

💪
NVIDIA HGX H100

The NVIDIA HGX H100 enables up to seven times more efficient high-performance computing (HPC) applications, up to nine times faster AI training on large models, and up to thirty times faster AI inference than the NVIDIA HGX A100.
This speed, combined with the lowest NVIDIA GPUDirect network latency in the market with the NVIDIA Quantum-2 InfiniBand platform, reduces the training time of AI models to "days or hours, instead of months." With AI permeating nearly every industry today, this speed and efficiency has never been more vital for HPC applications.

Sunk: SLURM on Kubernetes

SLURM is the incumbent job scheduler of the HPC world. Sunk, CoreWeave's implementation of SLURM on Kubernetes, allows you to leverage the resource management of SLURM on Kubernetes.
Note
SLURM support is currently available for reserved instance customers only. Please contact support for more information.

Model training on CoreWeave

Training Machine Learning models, especially models of modern Deep Neural Networks, is at the center of CoreWeave Cloud's architecture. The entire CoreWeave Cloud stack is purpose-built to enable highly scalable, cost-efficient model training.
In addition to its core tech stack, CoreWeave has a history of supporting our customers with cutting-edge Machine Learning research through in-house expertise, industry partnerships, and contributions to research organizations. Our team has extensive experience in training large transformer architectures, optimizing Hugging Face code, and selecting the right hardware for any given job.

Inference on CoreWeave

CoreWeave Cloud's inference engine autoscales containers based on demand in order to to swiftly fulfill user requests, then scales down according to load so as to preserve GPU resources. Allocating new resources and scaling up new containers can be as fast as 15 seconds for the 6B GPT-J model.
Allocating new resources and scaling up a container can be as fast as fifteen seconds for the 6B GPT-J model. This quick autoscale allows for a significantly more responsive service than that of other Cloud providers.