Inference
Welcome to Inference on CoreWeave
Machine learning is one of the most popular applications of CoreWeave Cloud's state-of-the-art infrastructure. Models hosted on CoreWeave can be sourced from a range of storage backends including S3-compatible object storage, HTTP, or persistent Storage Volumes.
CoreWeave Cloud's Inference engine autoscales containers based on demand to fulfill user requests, then scales down according to load to preserve GPU resources. Allocating new resources and scaling up a container can be as fast as fifteen seconds for the 6B GPT-J model.
The CoreWeave inference stack
CoreWeave Cloud's inference stack is backed by well-supported Open Source tools.
- Knative Serving (
ksvc
) acts as the serverless runtime, which manages autoscaling, revision control, and canary deployments. - KServe (
InferenceService
) provides an easy-to-use interface with Kubernetes resource definitions for deploying models without the fuss of configuring the underlying framework, such as Tensorflow.
Knative default parameters
The table below lists the global Knative defaults that have been adjusted by CoreWeave. Other default Knative settings have not been changed. See the Knative documentation for more information.
KNative Parameter | Value | Description |
---|---|---|
stable-window | 180s | The time period average concurrency is measured over in stable mode |
panic-window-percentage | 13.0 | How the window over which historical data is evaluated shrinks when entering panic mode - for example, a value of 13.0 means that in panic mode, the window is 13% of the stable window size |
container-concurrency-target-percentage | 85% | Scale to keep an average headroom of 15% of available concurrent request to accommodate for bursts |
max-scale-up-rate | 20 | Scale up at a maximum of 20x of capacity or 1 container (whichever is larger) per 15 seconds |
max-scale-down-rate | 1.1 | Scale down at a maximum of 10% of capacity or 1 container (whichever is larger) per 15 seconds |
scale-to-zero-pod-retention-period | 30m | If no requests have been received for 30 minutes, the service is scaled to zero and releases all resources. This behavior can be disabled by setting minReplicas to 1 in the service spec. |
scale-to-zero-grace-period | 60s | The upper bound time limit that the system waits for scale-from-zero machinery to be in place before the last replica is removed. |
scale-down-delay | 60s | Containers are only scaled down if scaled-down has been requested over a 60s period. This is to avoid thrashing. |
If concurrent request exceeds the scaled-for request volume by 200% during a period of 24 seconds, the autoscaler enters "panic mode," and starts scaling containers faster than the normal 180-second stable window. Some of these settings, such as stable window, can be modified using annotations on the InferenceService
.
Autoscaling
Autoscaling is enabled by default for all Inference Services.
Autoscaling parameters have been pre-configured for GPU-based workloads, where a large dataset usually needs to be loaded into GPU VRAM before serving can begin. Autoscaling is enabled any time the value of minReplicas
differs from the value of maxReplicas
in the InferenceService
spec. For example:
spec:predictor:minReplicas: 0maxReplicas: 10
Scale-to-Zero
Inference Services with long periods of idle time can automatically be scaled to zero. When scaled down, the Inference Service consumes no resources and incurs no billing. As soon as a new request comes in, a Pod is instantiated and the request is served. For small models, this can be as quick as five seconds. For larger models, loading times can be 30 to 60 seconds. Model loading times are dependent on the code responsible of loading the model into the GPU. Many popular PyTorch frameworks do not optimize their loading times.
To enable Scale To Zero, set minReplicas
to 0
. By default, scale-down happens in 0 to 30 minutes after the last request was served.
To set the max scale down delay lower, add the annotation:
autoscaling.knative.dev/scale-to-zero-pod-retention-period: "10m"
to the InferenceService
block. By default, this is set to 30 minutes.
Storage options
Models deployed with either InferenceService
or ksvc
can be served directly from CoreWeave's S3-compatible Object Storage.
Consider using CoreWeave's Accelerated Object Storage for frequently accessed data that doesn't change often, such as model weights and training data. This also provides a fallback option if one region experiences downtime, because data can be pulled across regions and cached where the workloads are located.
Tensorizer can also dramatically reduce the time required to load PyTorch models. Tensorizer pre-processes the models and serves them from Accelerated Object Storage so Knative services start quickly.
To learn more about optimizing model loading times with Object Storage, see Inference Best Practices for Knative Serving.
Using PVCs
If loading from Object Storage isn't feasible for your workload, InferenceService
deployments can also serve models from Persistent Volume Claims. This is less efficient than Object Storage, but it's a viable option for certain use-cases.
PVCs are supported by InferenceService
, but are not supported for ksvc
.
When using a Persistent Volume Claim (PVC), make sure to mount the PVC in the same region as the Pods to reduce network latency.
The model can be written to a PVC from any container deployed with CoreWeave Apps such as SSH, Jupyter, or FileBrowser. Determined AI MLOps, also deployed from CoreWeave Apps, can write models directly to a persistent volume for use by InferenceService
.
The example below demonstrates mounting a persistent volume in an InferenceService
:
apiVersion: serving.kubeflow.org/v1beta1kind: InferenceServicemetadata: ...spec:predictor:containers:- env:- name: STORAGE_URIvalue: pvc:///pvc-name/
Compute selectors and affinities
All values for all available node types may be found on the Node Types page.
GPU and CPU types are specified using Kubernetes affinities. For example:
affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: gpu.nvidia.com/classoperator: Invalues:- Quadro_RTX_5000
Billing
For on-demand customers, billing is done on a per-minute basis when containers are running. Scale-to-Zero allows rarely used models to incur no costs, while still making the models available to receive requests.
When an InferenceService
is scaled to zero due to being idle, it can take between 15 and 60 seconds, depending on the model size, until the first API request is answered. If this is unacceptable, it is recommended to set minReplicas
to 1
.
How-to guides and tutorials
The provided inference how-to guides and tutorials are tailored specifically for common use cases with popular models, such as GPT-J and Stable Diffusion. More examples are available in the KServe repository.