Inference on CoreWeave

Welcome to inference on CoreWeave Cloud

While there are many ways to accomplish inference at scale at CoreWeave, all of CoreWeave's inference solutions prioritize two key aspects: cost efficiency, and ease of scalability.

This guide explores our solutions to help you determine what works best for your use case.

Common inference questions

To start, here are some common questions and answers regarding inference on CoreWeave Cloud.

Should I use a request queue to hold requests until compute becomes available?

Answer: Probably not.

A request queue is typically employed to ensure each inference request is submitted and then run as GPU resources become available. This method is commonly employed at other Cloud platforms, however at CoreWeave, it is typically not required or even recommended to employ request queues.

Instead, rely on autoscaling to automatically scale the amount of compute needed to run the inference task for a simpler, smoother way of approaching task requests.

If there is a specific reason as to why a request queue is preferred or required for your workflow, please contact CoreWeave support to discuss your requirements.

Should I use, or continue to use, GPU "bin packing" on CoreWeave?

Answer: Never.

GPU "bin packing" describes a method in which multiple models are loaded into one very large GPU. Sometimes, it also involves swapping different models in and out of a single GPU as needed.

This method is strongly discouraged at CoreWeave. It is not necessary - there are plenty of GPU types and sizes to choose from.

Instead, select the right GPU appropriate for your workload size, then leverage autoscaling to automatically scale resources up and down, including to zero (using Scale to Zero), and so forth. In addition to being a cleaner methodology, this approach also saves money on resource consumption.

๐Ÿ‘‰ Learn more with the GPU Selection Guide

Should I purchase compute in the data center located as close as possible to my end users?

Answer: Not necessarily.

It is very common for clients to aim for compute hosted in data centers located as close as possible to their end users to minimize network latency.

On CoreWeave, with the size of models using the GPU types we have, the time it takes for the model to generate responses greatly overshadows the added latency from using data centers located farther from end users - which is to say that it is generally recommended to prioritize selecting a data center region that houses the compute nodes best suited to your workload size, as opposed to prioritizing its geographic location.

Visit the CloudPing service to compare ping times between data center locations. Click the "HTTP Ping" button at the top of the page to initiate a ping request to all listed data center locations to compare them directly.

Does CoreWeave cache Docker images?

Answer: Yes

Docker images are automatically cached on CoreWeave, in order to prevent needing to pull large images from external registries each time.

Solutions overview

The following overviews offer brief descriptions of each of CoreWeave's solutions for inference. To learn more about any solution, click the Learn more about... card provided in each section.

Storage

Prior to being loaded, model checkpoints and input data must be stored somewhere - either in a remote storage location, or in a drive local to the compute infrastructure.

  • Where are the models, model checkpoints, and input data being stored?

  • How will models and data be loaded into the inference service?

ProductBest for...

...storing smaller models, or models serialized with Tensorizer

...storing larger models, especially models serialized with Tensorizer

...storing large models, including models serialized with Tensorizer

CoreWeave Object Storage

Best for: Training code, training checkpoints

CoreWeave Object Storage is an S3-compatible solution that allows data to be stored and retrieved in a flexible and efficient way, featuring multi-region support, easy start-up, and simple SDK integrations.

CoreWeave Accelerated Object Storage

Best for: Training code, training checkpoints, model weights

CoreWeave's Accelerated Object Storage is a series of Anycasted, NVMe-backed storage providing rack-local proxy caches to provide blazing fast download speeds. It is ideal for storing training code, training checkpoints, and model weights. Training code may best be served using Accelerated Object Storage.

All-NVMe network-attached storage

Best for: Training code, datasets

The high performance block storage volumes served from the all-NVMe storage tier are an ideal solution for dataset or training code storage. These virtual disks readily outperform local workstation SSDs, and are scalable up to the Petabyte scale.

Presented as generic block devices, they are treated by the Operating System as a traditional, physically-connected storage device.

Serialization

CoreWeave's Tensorizer

CoreWeave's Tensorizer tool is a serializer and deserializer for modules, models, and tensors, which makes it possible to load even large models in less than five seconds for easier, more flexible, and more cost-efficient methods of serving models at scale.

Compute

High-end compute is imperative for model training and fine-tuning. CoreWeave specializes in providing several types of high-end GPUs and CPUs for inference.

  • What types and sizes of GPUs run your inference?

Our GPU selection guide and benchmark comparison assists in selecting the best compute for your use case.

GPU and CPU nodes

CoreWeave's entire infrastructure stack is designed with model training and inference in mind. Node hardware is served from high-end data center regions across North America, and is purpose-built for HPC workloads. An extensive selection of high performance GPUs and CPUs are available for model training uses, including NVIDIA HGX H100s.

Note

Node type availability is contingent upon contract type.

HPC Interconnect

NVLink enables high-speed hardware connectivity by leveraging shared pools of memory, allowing GPUs to send and receive data extremely quickly.

GPUDirect allows GPUs to communicate directly with other GPUs across an InfiniBand fabric, without passing through the host system CPU and operating system kernel, significantly lowering synchronization latency.

SHARP allows for a flat scaling curve and significantly improved effective interconnect bandwidth.

Many of CoreWeave's GPUs are enabled with NVIDIA NVLink GPU interconnect. With a special wiring array and software component, NVLink enables high-speed hardware connectivity by leveraging shared pools of memory, allowing GPUs to send and receive data extremely quickly. NVLink provides a significantly faster alternative for connecting multi-GPU systems compared to traditional PCIe-based solutions.

To select a GPU with NVLink capability, look for the node types with NVLink in their titles and labels.

InfiniBand with GPUDirect RDMA

CoreWeave has partnered with NVIDIA in its design of interconnect for A100 HGX training clusters. All CoreWeave A100 NVLINK GPUs offer GPUDirect RDMA over InfiniBand, in addition to standard IP/Ethernet networking.

GPUDirect allows GPUs to communicate directly with other GPUs across an InfiniBand fabric, without passing through the host system CPU and operating system kernel, significantly lowering synchronization latency.

SHARP

Traditionally, communication requirements scale proportionally with number of nodes in a HPC cluster. NVIDIAยฎ Mellanoxยฎ Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) moves collection operations from individual nodes into the network. This allows for a flat scaling curve and significantly improved effective interconnect bandwidth.

Autoscaling

Autoscaling is imperative for high availability, ensuring inference services are running when needed, and scaled down when not in use.

  • Are requests queued or batched?

  • How important is high-end latency for your product?

  • How frequently are requests sent to services?

  • How do you handle changes in the demand for the inference service?

  • How important is high-end latency for your product? (For example, if 95% of requests are fast, but 5% are slower, is that acceptable, or not?)

SolutionDescription

Knative handles autoscaling, load balancing, and scaling to zero for your Inference Service.

Optimal for batch inference or inference that takes longer than one minute.

Knative for inference

Knative handles autoscaling on CoreWeave to ensure that the right amount of resources is used for a given workload. By using only what is required for the inference service, usage costs are kept low, and the need for more manual resource requests is prevented.

Tip

Autoscaling is enabled by default for all inference services.

Note

If inference takes longer than one minute, or you are performing batch inference tasks, it is recommended to use Argo Workflows for inference, rather than Knative.

Scale-to-Zero

Scale-to-Zero is an autoscaling feature that allows the number of GPUs to be scaled to zero, while preserving networking, based on request frequency. When configured, Inference Services with long periods of idle time will automatically be scaled to zero so as not to consume any resources, and thus incur no billing. Configured networking infrastructure remains active, so that as soon as a new request arrives, a new Pod will be instantiated in order to serve the request.

Note

Scale-to-Zero is best used in instances where some Services are only occasionally activated.

Knative load balancing

Knative also automatically handles load balancing for Services, removing the need to manually configure and manage load balancers. As requests and usage grow, Knative scales up to meet the increased demand, based on configurable targets (such as percentage of resources used).

The more tolerable you are to high-end latency, the more aggressive you can be with raising the target utilization percentage, which increases cost effectiveness. Knative can also automatically handle load balancing for your services based on traffic load.

Additional Resources

Learn more about Knative usage for inference in our Inference Best Practices for Knative guide. For a list of Knative parameters on CoreWeave, see Inference: Knative default parameters.

Argo Workflows

Best for: Batch inference, inference that takes over 1 minute

Argo Workflows is a powerful, open-source workflow management system available in the CoreWeave Applications Catalog.

It's used to define, execute, and manage complex, multi-step workflows in a code-based manner. It's developed and maintained as a Cloud Native Computing Foundation (CNCF) Graduated project, and uses the principles of Cloud-native computing to ensure scalability, resiliency, and flexibility.

Usage

There are some additional considerations regarding are how your Inference Service is or will be used. In order to determine the optimal solutions for your use case, consider:

Security and authentication requirements

  • How do requests reach the inference service?

    • By HTTP/RPC calls?

    • Do requests go across Cloud platforms?

Requests and batching requirements

Additional features

Highly available GPUs

Scaling to high numbers of GPUs is no issue on CoreWeave Cloud - availability is widely available for our standard, high-performance GPUs, all managed by CoreWeave engineers.

Loadbalancing

Node autoscalers are not required - CoreWeave manages all nodes, 24/7. The only scaling requirements you need to configure for your use case is scaling for Pods, which is also configured and handled by Kubernetes.

Nydus

Nydus is an external plugin for containerd leveraging the Nydus container image service, which implements a content-addressable filesystem on top of a RAFS format for container images. This formatting allows for major improvements to the current OCI image specification in terms of container launching speed, image space, network bandwidth efficiency, and data integrity. Nydus on CoreWeave Cloud employs a "lazy loading" feature, making Docker image load times even faster.

CoreWeave ML containers

CoreWeave provides optimized container images for machine learning applications, tuned specifically to get the highest performance on the CoreWeave platform.

NCCL tests

CoreWeave supports the NVIDIA Collective Communication Library (NCCL) for powering multi-GPU and multi-node neural network training. NCCL underpins the vast majority of all distributed training frameworks such as DeepSpeed, PyTorch Distributed and Horovod.

The NCCL tests repository is an open source repository containing NCCL tests, which clients may use to measure their own inference performance.

Need help?

If you've determined the needs for your use case, but still need additional assistance, or aid in building custom configurations for your workflow, please reach out to CoreWeave support.

Last updated