Get Started with Inference
Welcome to Inference on CoreWeave Cloud
CoreWeave Cloud's Inference stack is backed by well-supported Open Source tools:
- Knative Serving acts as the serverless runtime, which manages autoscaling, revision control, and canary deployments.
- KServe provides an easy to use interface via Kubernetes resource definitions for deploying models without the fuss of correctly configuring the underlying framework (i.e., Tensorflow).
The flow of a request in the Inference engine
The table below lists the global Knative defaults that have been adjusted by CoreWeave, though there are additional Knative settings that have not been altered.
The time period average concurrency is measured over in stable mode
Indicates how the window over which historical data is evaluated will shrink upon entering panic mode - for example, a value of 13.0 means that in panic mode, the window will be 13% of the stable window size
Scale to keep an average headroom of 15% of available concurrent request to accommodate for bursts
Scale up at a maximum of 20x of current capacity or 1 container (whichever is larger) per 15 seconds
Scale down at a maximum of 10% of current capacity or 1 container (whichever is larger) per 15 seconds
If no requests have been received for 30 minutes, a service will be scaled to zero and not use any resources. This behavior can be disabled by setting minReplicas to 1 in the service spec
The upper bound time limit that the system will internally wait for scale-from-zero machinery to be in place, before the last replica is removed
Containers are only scaled down if scaled-down has been requested over a 60s period. This is to avoid thrashing.
If concurrent request exceeds the current scaled-for request volume by 200% during a period of 24 seconds, the autoscaler enters "panic mode," and starts scaling containers faster than the normal 180-second stable window. Some of these settings, such as stable window, can be modified using annotations on the
Autoscaling is enabled by default for all Inference Services.
Autoscaling parameters have been pre-configured for GPU-based workloads, where a large dataset usually needs to be loaded into GPU VRAM before serving can begin. Autoscaling is enabled any time the value of
minReplicasdiffers from the value of
InferenceServicespec. For example:
Inference Services with long periods of idle time can automatically be scaled to zero. When scaled down, the Inference Service will consume no resources and incur no billing. As soon as a new request comes in, a Pod will be instantiated and the request will be served. For small models, this can be as quick as five seconds. For larger models, spin up times can be 30 to 60 seconds. Model loading times are highly dependent on the code responsible of loading the model into the GPU. Many popular PyTorch frameworks do not optimize for optimal loading times.
To enable Scale To Zero, simply set
0. By default, scale-down will happen in 0 to 30 minutes after the last request was served.
To set the max scale down delay lower, add the annotation:
InferenceServiceblock. By default, this is set to 30 minutes.
For on-demand customers, billing is done on a per-minute basis when containers are running. Scale-to-Zero allows rarely used models to incur no costs, while still making the models available to receive requests.
InferenceServiceis scaled to zero due to being idle, it usually takes 15-60 seconds depending on the model size until the first API request is answered. If this is unacceptable, it is recommended to set
Models can be served directly from CoreWeave's S3-compatible Object Storage. For faster container launch times in a production environment, it is recommended to cache the model in a
ReadWriteManypersistent volume on CoreWeave storage.
The model can be written to a PVC from any container deployed via CoreWeave Apps such as SSH, Jupyter, or FileBrowser. The Determined AI MLOps platform can also write models directly to a PVC for usage by an
InferenceService. Determined AI can also be deployed via the applications Catalog.
For best performance results, avoid including models over
1GBin size in a Docker container. For larger models, loading from a storage PVC is strongly recommended. For optimal loading speeds, NVMe-backed storage volumes are recommended.
- key: gpu.nvidia.com/class
Our Machine Learning and AI examples and tutorials are tailored specifically for common use cases with popular models, such as GPT-J and Stable Diffusion. Additionally, there are many more examples in the KServe repository that can be used directly in your namespace.