Skip to main content


Welcome to Inference on CoreWeave

Machine learning is one of the most popular applications of CoreWeave Cloud's state-of-the-art infrastructure. Models are easily hosted on CoreWeave, and can be sourced from a range of storage backends including S3-compatible object storage, HTTP, or persistent Storage Volumes.

CoreWeave Cloud's Inference engine autoscales containers based on demand in order to swiftly fulfill user requests, then scales down according to load so as to preserve GPU resources.

Allocating new resources and scaling up a container can be as fast as fifteen seconds for the 6B GPT-J model. This quick autoscale allows for a significantly more responsive service than that of other Cloud providers.

The CoreWeave inference stack

The flow of a request in the Inference engine

CoreWeave Cloud's inference stack is backed by well-supported Open Source tools.

  • Knative Serving acts as the serverless runtime, which manages autoscaling, revision control, and canary deployments.
  • KServe provides an easy to use interface via Kubernetes resource definitions for deploying models without the fuss of correctly configuring the underlying framework (i.e., Tensorflow).

Knative default parameters

The table below lists the global Knative defaults that have been adjusted by CoreWeave, though there are additional Knative settings that have not been altered.

Refer to the Knative documentation for more information.

KNative ParameterValueDescription
stable-window180sThe time period average concurrency is measured over in stable mode
panic-window-percentage13.0Indicates how the window over which historical data is evaluated will shrink upon entering panic mode - for example, a value of 13.0 means that in panic mode, the window will be 13% of the stable window size
container-concurrency-target-percentage85%Scale to keep an average headroom of 15% of available concurrent request to accommodate for bursts
max-scale-up-rate20Scale up at a maximum of 20x of current capacity or 1 container (whichever is larger) per 15 seconds
max-scale-down-rate1.1Scale down at a maximum of 10% of current capacity or 1 container (whichever is larger) per 15 seconds
scale-to-zero-pod-retention-period30mIf no requests have been received for 30 minutes, a service will be scaled to zero and not use any resources. This behavior can be disabled by setting minReplicas to 1 in the service spec
scale-to-zero-grace-period60sThe upper bound time limit that the system will internally wait for scale-from-zero machinery to be in place, before the last replica is removed
scale-down-delay60sContainers are only scaled down if scaled-down has been requested over a 60s period. This is to avoid thrashing.

If concurrent request exceeds the current scaled-for request volume by 200% during a period of 24 seconds, the autoscaler enters "panic mode," and starts scaling containers faster than the normal 180-second stable window. Some of these settings, such as stable window, can be modified using annotations on the InferenceService.


Autoscaling is enabled by default for all Inference Services.

Autoscaling parameters have been pre-configured for GPU-based workloads, where a large dataset usually needs to be loaded into GPU VRAM before serving can begin. Autoscaling is enabled any time the value of minReplicas differs from the value of maxReplicas in the InferenceService spec. For example:

minReplicas: 0
maxReplicas: 10


Inference Services with long periods of idle time can automatically be scaled to zero. When scaled down, the Inference Service will consume no resources and incur no billing. As soon as a new request comes in, a Pod will be instantiated and the request will be served. For small models, this can be as quick as five seconds. For larger models, spin up times can be 30 to 60 seconds. Model loading times are highly dependent on the code responsible of loading the model into the GPU. Many popular PyTorch frameworks do not optimize for optimal loading times.

To enable Scale To Zero, simply set minReplicas to 0. By default, scale-down will happen in 0 to 30 minutes after the last request was served.

To set the max scale down delay lower, add the annotation: "10m"

to the InferenceService block. By default, this is set to 30 minutes.

Model storage

Models can be served directly from CoreWeave's S3-compatible Object Storage. For faster container launch times in a production environment, it is recommended to cache the model in a ReadWriteMany persistent volume on CoreWeave storage.

The model can be written to a PVC from any container deployed via CoreWeave Apps such as SSH, Jupyter, or FileBrowser. The Determined AI MLOps platform can also write models directly to a PVC for usage by an InferenceService. Determined AI can also be deployed via the applications Catalog.

For best performance results, avoid including models over 1GB in size in a Docker container. For larger models, loading from a storage PVC is strongly recommended. For optimal loading speeds, NVMe-backed storage volumes are recommended.

Compute selectors and affinities

Additional Resources

All values for all available node types may be found on the Node Types page.

GPU and CPU types are specified using Kubernetes affinities. For example:

- matchExpressions:
- key:
operator: In
- Quadro_RTX_5000


For on-demand customers, billing is done on a per-minute basis when containers are running. Scale-to-Zero allows rarely used models to incur no costs, while still making the models available to receive requests.


When an InferenceService is scaled to zero due to being idle, it usually takes 15-60 seconds depending on the model size until the first API request is answered. If this is unacceptable, it is recommended to set minReplicas to 1.

How-to guides and tutorials

The provided inference how-to guides and tutorials are tailored specifically for common use cases with popular models, such as GPT-J and Stable Diffusion. Additionally, there are many more examples in the KServe repository, which can be used directly in your namespace.