Inference on CoreWeave
Welcome to inference on CoreWeave Cloud
While there are many ways to accomplish inference at scale at CoreWeave, all of CoreWeave's inference solutions prioritize two key aspects: cost efficiency, and ease of scalability.
This guide explores our solutions to help you determine what works best for your use case.
Common inference questionsโ
To start, here are some common questions and answers regarding inference on CoreWeave Cloud.
Should I use a request queue to hold requests until compute becomes available?
Answer: โ ๏ธ Probably not.
A request queue is typically employed to ensure each inference request is submitted and then run as GPU resources become available. This method is commonly employed at other Cloud platforms, however at CoreWeave, it is typically not required or even recommended to employ request queues.
Instead, rely on autoscaling to automatically scale the amount of compute needed to run the inference task for a simpler, smoother way of approaching task requests.
If there is a specific reason as to why a request queue is preferred or required for your workflow, please contact CoreWeave support to discuss your requirements.
Should I use, or continue to use, GPU "bin packing" on CoreWeave?
Answer: ๐ซ Never.
GPU "bin packing" describes a method in which multiple models are loaded into one very large GPU. Sometimes, it also involves swapping different models in and out of a single GPU as needed.
This method is strongly discouraged at CoreWeave. It is not necessary - there are plenty of GPU types and sizes to choose from.
Instead, select the right GPU appropriate for your workload size, then leverage autoscaling to automatically scale resources up and down, including to zero (using Scale to Zero), and so forth. In addition to being a cleaner methodology, this approach also saves money on resource consumption.
Should I purchase compute in the data center located as close as possible to my end users?
Answer: โ ๏ธ Not necessarily.
It is very common for clients to aim for compute hosted in data centers located as close as possible to their end users to minimize network latency.
On CoreWeave, with the size of models using the GPU types we have, the time it takes for the model to generate responses greatly overshadows the added latency from using data centers located farther from end users - which is to say that it is generally recommended to prioritize selecting a data center region that houses the compute nodes best suited to your workload size, as opposed to prioritizing its geographic location.
Visit the CloudPing service to compare ping
times between data center locations. Click the "HTTP Ping" button at the top of the page to initiate a ping
request to all listed data center locations to compare them directly.
Does CoreWeave cache Docker images?
Answer: โ Yes
Docker images are automatically cached on CoreWeave, in order to prevent needing to pull large images from external registries each time.
Solutions overviewโ
The following overviews offer brief descriptions of each of CoreWeave's solutions for inference. To learn more about any solution, click the Learn more about... card provided in each section.
Storageโ
Prior to being loaded, model checkpoints and input data must be stored somewhere - either in a remote storage location, or in a drive local to the compute infrastructure.
- Where are the models, model checkpoints, and input data being stored?
- How will models and data be loaded into the inference service?
Product | Best for... |
---|---|
CoreWeave Object Storage | ...storing smaller models, or models serialized with Tensorizer |
CoreWeave Accelerated Object Storage | ...storing larger models, especially models serialized with Tensorizer |
All-NVMe network-attached storage | ...storing large models, including models serialized with Tensorizer |
CoreWeave Object Storageโ
Best for: Training code, training checkpoints
CoreWeave Object Storage is an S3-compatible solution that allows data to be stored and retrieved in a flexible and efficient way, featuring multi-region support, easy start-up, and simple SDK integrations.
CoreWeave Accelerated Object Storageโ
Best for: Training code, training checkpoints, model weights
CoreWeave's Accelerated Object Storage is a series of Anycasted, NVMe-backed storage providing rack-local proxy caches to provide blazing fast download speeds. It is ideal for storing training code, training checkpoints, and model weights. Training code may best be served using Accelerated Object Storage.
All-NVMe network-attached storageโ
Best for: Training code, datasets
The high performance block storage volumes served from the all-NVMe storage tier are an ideal solution for dataset or training code storage. These virtual disks readily outperform local workstation SSDs, and are scalable up to the Petabyte scale.
Presented as generic block devices, they are treated by the Operating System as a traditional, physically-connected storage device.
Serializationโ
CoreWeave's Tensorizerโ
CoreWeave's Tensorizer tool is a serializer and deserializer for modules, models, and tensors, which makes it possible to load even large models in less than five seconds for easier, more flexible, and more cost-efficient methods of serving models at scale.
Computeโ
High-end compute is imperative for model training and fine-tuning. CoreWeave specializes in providing several types of high-end GPUs and CPUs for inference.
- What types and sizes of GPUs run your inference?
Solution | See also |
---|---|
CoreWeave GPUs and CPUs | Our GPU selection guide and benchmark comparison assists in selecting the best compute for your use case. |
GPU and CPU nodesโ
CoreWeave's entire infrastructure stack is designed with model training and inference in mind. Node hardware is served from high-end data center regions across North America, and is purpose-built for HPC workloads. An extensive selection of high performance GPUs and CPUs are available for model training uses, including NVIDIA HGX H100s.
Node type availability is contingent upon contract type.
HPC Interconnectโ
Solution | Description |
---|---|
NVLink | NVLink enables high-speed hardware connectivity by leveraging shared pools of memory, allowing GPUs to send and receive data extremely quickly. |
InfiniBand with GPUDirect RDMA | GPUDirect allows GPUs to communicate directly with other GPUs across an InfiniBand fabric, without passing through the host system CPU and operating system kernel, significantly lowering synchronization latency. |
SHARP | SHARP allows for a flat scaling curve and significantly improved effective interconnect bandwidth. |
NVLinkโ
Many of CoreWeave's GPUs are enabled with NVIDIA NVLink GPU interconnect. With a special wiring array and software component, NVLink enables high-speed hardware connectivity by leveraging shared pools of memory, allowing GPUs to send and receive data extremely quickly. NVLink provides a significantly faster alternative for connecting multi-GPU systems compared to traditional PCIe-based solutions.
To select a GPU with NVLink capability, look for the node types with NVLink
in their titles and labels.
InfiniBand with GPUDirect RDMAโ
CoreWeave has partnered with NVIDIA in its design of interconnect for A100 HGX training clusters. All CoreWeave A100 NVLINK GPUs offer GPUDirect RDMA over InfiniBand, in addition to standard IP/Ethernet networking.
GPUDirect allows GPUs to communicate directly with other GPUs across an InfiniBand fabric, without passing through the host system CPU and operating system kernel, significantly lowering synchronization latency.
SHARPโ
Traditionally, communication requirements scale proportionally with number of nodes in a HPC cluster. NVIDIAยฎ Mellanoxยฎ Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) moves collection operations from individual nodes into the network. This allows for a flat scaling curve and significantly improved effective interconnect bandwidth.
Autoscalingโ
Autoscaling is imperative for high availability, ensuring inference services are running when needed, and scaled down when not in use.
- Are requests queued or batched?
- How important is high-end latency for your product?
- How frequently are requests sent to services?
- How do you handle changes in the demand for the inference service?
- How important is high-end latency for your product? (For example, if 95% of requests are fast, but 5% are slower, is that acceptable, or not?)
Solution | Description |
---|---|
Knative for inference | Knative handles autoscaling, load balancing, and scaling to zero for your Inference Service. |
Argo Workflows | Optimal for batch inference or inference that takes longer than one minute. |
Knative for inferenceโ
Knative handles autoscaling on CoreWeave to ensure that the right amount of resources is used for a given workload. By using only what is required for the inference service, usage costs are kept low, and the need for more manual resource requests is prevented.
Autoscaling is enabled by default for all inference services.
If inference takes longer than one minute, or you are performing batch inference tasks, it is recommended to use Argo Workflows for inference, rather than Knative.
Scale-to-Zeroโ
Scale-to-Zero is an autoscaling feature that allows the number of GPUs to be scaled to zero, while preserving networking, based on request frequency. When configured, Inference Services with long periods of idle time will automatically be scaled to zero so as not to consume any resources, and thus incur no billing. Configured networking infrastructure remains active, so that as soon as a new request arrives, a new Pod will be instantiated in order to serve the request.
Scale-to-Zero is best used in instances where some Services are only occasionally activated.
Knative load balancingโ
Knative also automatically handles load balancing for Services, removing the need to manually configure and manage load balancers. As requests and usage grow, Knative scales up to meet the increased demand, based on configurable targets (such as percentage of resources used).
The more tolerable you are to high-end latency, the more aggressive you can be with raising the target utilization percentage, which increases cost effectiveness. Knative can also automatically handle load balancing for your services based on traffic load.
Learn more about Knative usage for inference in our Inference Best Practices for Knative guide. For a list of Knative parameters on CoreWeave, see Inference: Knative default parameters.
Argo Workflowsโ
Best for: Batch inference, inference that takes over 1 minute
Argo Workflows is a powerful, open-source workflow management system available in the CoreWeave Applications Catalog.
It's used to define, execute, and manage complex, multi-step workflows in a code-based manner. It's developed and maintained as a Cloud Native Computing Foundation (CNCF) Graduated project, and uses the principles of Cloud-native computing to ensure scalability, resiliency, and flexibility.
Usageโ
There are some additional considerations regarding are how your Inference Service is or will be used. In order to determine the optimal solutions for your use case, consider:
Security and authentication requirementsโ
- How do requests reach the inference service?
- By HTTP/RPC calls?
- Do requests go across Cloud platforms?
Requests and batching requirementsโ
- Do requests go into a queue first?
- Is batching used?
- If batching is employed, please refer to the Knative for Inference Best Practices guide section on concurrency parameters and batching strategies to ensure high performance.
- What is the average response time of the request?
- If inference takes longer than one minute, or you are performing batch inference tasks, it is strongly recommended to use Argo Workflows for inference instead of Knative.
Additional featuresโ
Highly available GPUsโ
Scaling to high numbers of GPUs is no issue on CoreWeave Cloud - availability is widely available for our standard, high-performance GPUs, all managed by CoreWeave engineers.
Loadbalancingโ
Node autoscalers are not required - CoreWeave manages all nodes, 24/7. The only scaling requirements you need to configure for your use case is scaling for Pods, which is also configured and handled by Kubernetes.
Nydusโ
Nydus is an external plugin for containerd leveraging the Nydus container image service, which implements a content-addressable filesystem on top of a RAFS format for container images. This formatting allows for major improvements to the current OCI image specification in terms of container launching speed, image space, network bandwidth efficiency, and data integrity. Nydus on CoreWeave Cloud employs a "lazy loading" feature, making Docker image load times even faster.
CoreWeave ML containersโ
CoreWeave provides optimized container images for machine learning applications, tuned specifically to get the highest performance on the CoreWeave platform.
NCCL testsโ
CoreWeave supports the NVIDIA Collective Communication Library (NCCL) for powering multi-GPU and multi-node neural network training. NCCL underpins the vast majority of all distributed training frameworks such as DeepSpeed, PyTorch Distributed and Horovod.
The NCCL tests repository is an open source repository containing NCCL tests, which clients may use to measure their own inference performance.
Need help?โ
If you've determined the needs for your use case, but still need additional assistance, or aid in building custom configurations for your workflow, please reach out to CoreWeave support.