Inference Best Practices for Knative Serving

Knative Serving, the serverless runtime for CoreWeave Cloud's inference stack, manages autoscaling, revision control, and canary deployments. Follow these best practices when deploying machine learning models on CoreWeave Cloud for the best performance and resource utilization.

Don't package models in containers

Most importantly — do not package machine learning models in the container images.

Knative Serving is designed for rapid scaling, including scaling to zero instances. This works best with small containers that spin up and down quickly. Large models significantly increase the size of containers, making deployments slower. Also, if the model is packed in the container, updating the model requires building a new container image, which makes versioning more complicated.

Instead, use these strategies:

Use CoreWeave's Accelerated Object Storage for frequently accessed data that doesn't change often, such as model weights and training data. This also provides a fallback option if one region experiences downtime, because data can be pulled across regions and cached where the workloads are located.
Use Tensorizer to dramatically reduce the time to load PyTorch models. Tensorizer pre-processes the models and serves them from Accelerated Object Storage so Knative services start quickly.

If these methods aren't feasible, consider loading models from a Persistent Volume Claim (PVC), in the same region as the Pods to reduce network latency. This is less efficient than loading from Object Storage with Tensorizer, but it's a viable option for certain use-cases.

Learn more

PVCs are supported by InferenceService, but not by ksvc.

See Inference storage options for more information.

Optimize the container

Note

Whenever possible, it is strongly recommended to use CoreWeave's optimized container images as a starting point for inference containers. These are optimized for performance on CoreWeave infrastructure, and are preloaded with CUDA, PyTorch, and other critical libraries and apps.

Here are some basic tips for optimizing container performance.

Eliminate any non-essential files to reduce the container image size and improve startup times. Unless one of CoreWeave's optimized container images is used, container images should make maximum use of Docker BuildKit to leverage multi-stage builds in order to ensure that final images only download essential libraries and precompiled wheels, where possible.
Small containers start up faster, which minimizes over-provisioning of resources and lowers the expense to deliver the service. Fast containers also reduce the amount of manual autoscaling adjustments required.
Always keep external data close to where it's being processed. Use Accelerated Object Storage and Tensorizer to load large resources. If PVCs or container registries are required, deploy them in the same data center region as the Pods.
Include as many GPU-optimizing features as possible in the container's inference engine, such as in-flight batching, tensor parallelism, paged attention, and so on. This will significantly improve the per-request latency and throughput of inference Pods.

Important

Avoid loading models and resources from Hugging Face or other external services. These services introduce unpredictable latency and may not have the same uptime guarantees as CoreWeave's production environment.

Note: Some Hugging Face libraries automatically load data. Make sure to disable those code features for production use.

Container startup sequence

When deploying applications on Knative Serving, follow this container startup sequence for optimal performance and reliability.

Perform pre-initialization tasks such as setting environment variables, reading configuration files, and establishing database connections.
Load machine learning models and other large data sets to ensure they are available when the application starts serving requests.
Start internal services and run self-tests to validate that all components are functioning as expected. Handle any startup errors gracefully, and log them for debugging purposes.
Start the application server and verify it's ready to handle incoming requests.
After everything is ready, signal that the container is ready to the readiness probe. For TCP probes, now is the time to open the port. For HTTP probes, this is the stage to mark the endpoint as ready. See the Probe Configuration section below for more information.

Shutdown containers gracefully

To ensure that in-flight requests are not abruptly terminated, a container should initiate a graceful shutdown when it receives a SIGTERM signal.

After receiving the SIGTERM signal, if the application uses an HTTP probe, set the readiness endpoint to a non-ready status. This prevents the Kubernetes scheduler from routing new requests to the shutting-down container.

Continue to accept incoming requests. Knative's Queue-proxy might still be routing requests to the container during the shutdown process. Allow the application to complete any in-flight requests before exiting. There's no need to set a time limit for this draining process.

To ensure all requests have time to complete before the container is terminated, the timeoutSeconds parameter should be 1.2 times the duration of the longest expected request.

For example, if the longest request would take 10 seconds, set the timeoutSeconds parameter to 12 seconds:

{6}

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
  spec:
    timeoutSeconds: 12

Kubernetes will forcefully terminate non-responsive containers, eliminating the risk of lingering containers.

Handle client cancellation requests promptly

Managing cancellation requests promptly is crucial for optimum Knative Serving performance.

When a client cancels a request — perhaps due to a long wait time or other issues — Knative Serving immediately removes the request from the tracking queue, then reroutes new incoming requests to the same container that was processing the canceled request.

This approach aims to maximize resource utilization and maintain rapid response times. To accommodate this approach, containers must halt the ongoing task immediately and handle new incoming requests without delay. Failure to promptly halt the canceled request will increase latency and timeouts for new incoming requests.

Timeouts on the client side should be three times greater, or more, than the inference time. This is also true for anything that runs between the client and the service, such as CloudFlare.
Leverage server-side frameworks in the application code to manage client-side cancellations effectively. Many server-side frameworks offer built-in mechanisms to handle request cancellations.

Probe configuration

Liveness and readiness probes play a crucial role in maintaining the health and scalability of Knative services.

Liveness probe

The liveness probe monitors the health of a container. If the container is in a failed or stuck state, the liveness probe will restart it. Use the liveness probe as a safety net rather than a primary mechanism for error recovery. For non-recoverable errors it's better to exit the application, which will trigger an automatic restart of the pod.

Readiness probe

The readiness probe is particularly important for scaling and Knative functionality. Being marked as unready is preferred to failing the liveness probe.

TCP probes should not open the port until all components are loaded and the container is fully ready.
HTTP probes should not mark the endpoint ready until it's able to serve requests.
When adjusting the probe interval, periodSeconds, keep the intervals short. Ideally this is less than the default of 10 seconds.
Minimize the total interval during which a non-ready pod could be considered ready. This is important because the pod could receive requests during this period.

Allowing a few failures for the readiness probe is acceptable. However, be aware that this will extend the time a malfunctioning pod could be considered ready.

Concurrency parameters and batching strategies

Proper configuration of concurrency is critical to optimize Knative Serving. These parameters make significant differences to the application performance.

Hard concurrency

containerConcurrency is the hard concurrency limit — the maximum number of concurrent requests a single container instance can handle.

Perform an accurate assessment of the maximum number of requests the container can handle simultaneously without performance degradation. For GPU-intensive tasks, set containerConcurrency to 1 if there is limited information about its concurrency capabilities.

Do not use this parameter to create a request queue. To handle request bursts, consider implementing batching.

{8}

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
spec:
  template:
    spec:
      containerConcurrency: 1

Soft concurrency

autoscaling.knative.dev/target sets the soft concurrency target — the desired number of concurrent requests for each container.

In most cases, the target value should match the containerConcurrency value. This makes the configuration easier to understand and manage. GPU-intensive tasks that set containerConcurrency to 1 should also set target to 1.

{9}

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/target: 1

Target utilization

As a best practice, adjust target utilization rather than modifying target. This provides a more intuitive way to manage concurrency levels.

Adjust autoscaling.knative.dev/target-utilization-percentage as close to 100% as possible to achieve the most efficient full batches.

A good starting point is between 90-95%. Experimentation and adjustment may be needed to optimize this value.
Lower percentages (< 100%) increase the likelihood that pods will have to wait longer to fill their batch sizes. However, this provides extra capacity to handle traffic spikes.
If pod startup time is fast, higher percentages can be used. Otherwise, adjust the target to find a balance between steady-state latency and spike latency.

{9}

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/target-utilization-percentage: "90"

Batching strategies

Batching is often essential to optimize GPU workloads.

Knative Serving does not inherently support batching. Instead, requests are load-balanced evenly across Services. While it's possible to improve batching performance in a Knative service, there are some trade-offs to consider.

Hard and soft concurrency

When implementing batching, soft concurrency should generally be set equal to the batch size, with some exceptions.

Hard concurrency should be at least equal to the batch size to prevent queuing, however this may increase the time to accumulate the next set of batch size requests, especially in low-traffic or over-scaled scenarios. This is because setting the hard concurrency greater than the batch size may lead to request queuing in Pods, which in turn increases latency, particularly during high-traffic periods. Ideally, hard concurrency should not exceed twice the batch size, and should generally remain within 1.5 times the batch size.

Concurrency considerations with in-flight batching

If your inference engine supports in-flight batching and paged attention, you may find that the KV cache can support a larger batch size and a higher hard concurrency. Particularly in the case of in-flight batching, a higher hard concurrency no longer suffers from waiting for batch accumulation, as requests can come and go as they are processed. In this case, a higher hard concurrency is generally better, but there will still be a tradeoff between concurrency and per-request latency.

If soft concurrency and hard concurrencies are set too low, resources will be underutilized. If they are set too high, autoscaling may take place too late, when active Pods are already overloaded. This will increase per-request latency due to requests becoming stuck in batches for overloaded Pods, rather than being handled by newly scaled Pods as they should be.

It is generally recommended to set a soft concurrency that is high relative to the hard concurrency, but which is not too close to the hard concurrency, in order to maximize throughput while maintaining a buffer for bursts in traffic. This helps to prevent overloading existing Pods and buffering requests, because with earlier scaling, fresh Pods will be ready to take requests before older Pods begin to near their hard concurrency limit.

Container-level batching

To improve batching at the container level, configure the maximum timeoutSeconds to wait for the minimum number of requests needed for a batch. The timeout should be short enough so that the sum of the request processing time and this timeout doesn't trigger any additional timeouts.

{8}

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
  spec:
    template:
      spec:
        timeoutSeconds: 300

Knative Serving can achieve better batching performance by carefully adjusting these parameters for the specific workload.

Autoscaling

Autoscaling is a crucial component for managing resources and performance in Knative Serving. The default autoscaling type is concurrency, which is recommended for GPU-based services.

{9}

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/metric: "concurrency"

Before making any adjustments to the autoscaler, first define the optimization goal, such as reducing latency, minimizing costs, or accommodating spiky traffic patterns. Then, measure the time it takes for Pods to start in the expected scenarios.

For example, deploying a single Pod may have different variance and mean startup times than deploying multiple Pods simultaneously. These metrics provide the baseline for autoscaling configuration.

Modes

The autoscaler has two modes, stable and panic, with separate windows for each mode. Stable mode is used for general operation, while panic mode by default has a much shorter window, used to quickly scale a revision up if a burst of traffic arrives.

Stable mode window

In general, to maintain a steady scaling behavior, the stable window should be longer than the average pod startup time and twice the average time is a good starting point. If there is considerable variance in startup times, include that factor in the calculation. For very smooth traffic, increasing this further can reduce the small scale event variations that sometimes introduce small latency spikes, especially when running near a 100% target.

{9}

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/window: "40s"

Panic mode window

The panic window is a percentage of the stable window. Its primary function is to manage sudden, unexpected spikes in traffic. However, setting this parameter requires caution, as it can easily lead to over-scaling, particularly when pod startup times are long.

For example, if the stable window is set to 30 seconds and the panic window is configured at 10%, the system will use 3 seconds of data to determine whether to enter panic scaling mode. If Pods typically take 30 seconds to start, the system could continue to scale up while the new pods are still coming online, potentially leading to over-scaling.
Don't fine-tune the panic window until after adjusting other parameters and the scaling behavior is stable. This is particularly important when considering a panic window value shorter than, or equal to, the pod startup time. If the stable window is twice the average Pod startup time, start with a panic window value of 50% to balance both regular and spiky traffic patterns.

{10}

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/panic-window-percentage: "20.0"

Panic mode threshold is the ratio of incoming traffic to the service's capacity. Adjust this based on the spikiness of the traffic and acceptable latency levels. A good initial value is 200% or higher until the scaling behavior is stable. Consider adjusting this further after gathering deployment data.

{10}

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/panic-threshold-percentage: "150.0"

Scale rates

Scale up and scale down rates control how quickly a service scales in response to traffic patterns. While these rates are generally well-configured by default, there are specific scenarios where adjustments may be beneficial. Always monitor the current behavior before making adjustments to either rate. Any changes to these rates should first be tested in a staging environment to assess their impact. If experiencing issues with scaling, review other autoscaling parameters like the stable window and panic window, as they often interact with the scale up and scale down rates.

Scale up rate

The scale up rate usually doesn't require modification unless encountering specific issues. For instance, if scaling multiple pods simultaneously leads to increased startup times — perhaps due to all pods pulling from an overloaded registry or another external resource — then adjusting the scale up rate may be necessary. If over-scaling is detected, it's likely that other parameters like the stable window or panic window may also need fine-tuning.

{7}

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 max-scale-up-rate: "500.0"

Scale down rate

The scale down rate should initially be set to the mean pod startup time or longer. The default is generally sufficient for most use-cases, however if cost optimization is a priority, consider increasing this rate to scale down more quickly after a traffic spike. If a service experiences spiky traffic and the goal is to reduce costs, consider a faster scale-down setting if other parameters are well-tuned.

The scale down rate is a multiplier; for example, a value of N/2 would allow the system to scale down to half the current number of pods at the end of a scaling cycle. For services with low pod counts, a lower scale down rate can help maintain a smoother pod count, reducing the frequency of scaling events and thereby improving system stability.

{7}

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 max-scale-down-rate: "4.0"

Scale bounds

These parameters are highly specific to the deployment's traffic patterns and available resources.

Lower bound

autoscaling.knative.dev/min-scale controls the minimum number of replicas that each Revision should have. Knative will attempt to never have less than this number of replicas at any one point in time.

For services that are rarely used and should not scale to zero, a value of 1 is appropriate to guarantee the capacity is always available. Services that expect a lot of burst traffic should set this higher than 1.

To enable scale to zero, set this to 0.

{10}

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/min-scale: "0"

Upper bound

autoscaling.knative.dev/max-scale should not exceed the available resources, as it could lead to issues. Use this parameter to primarily limit maximum expenditure. Choosing a value too low can cause updates to roll a new revision, which is usually undesirable.

{10}

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/max-scale: "3"

Initial scale

Configure initial-scale so that a new revision can handle existing traffic adequately. A good starting point is 1.2 times the number of existing pods at the time of the rollout.

{10}

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/initial-scale: "3"

Handling long-running jobs

Knative Serving is optimized for quick, stateless requests and has built-in timeouts that may terminate long-running requests, resulting in incomplete or failed inferences.

Some machine learning inference requests may exceed two minutes. Consider these options before choosing Knative Serving for long-running jobs.

Consider using a different serving solution specifically designed for long-running tasks.
Use dedicated resources specifically for long-running tasks to ensure they don't interfere with the performance of other services.
For tasks that are not time-sensitive, consider queuing up tasks and processing them in batches to optimizing resource utilization.
Choose a hybrid approach, using Knative for short-lived tasks and a different solution for long-running tasks, to leverage the best features of each technology.

Regardless of the approach, it's crucial to implement monitoring and alerting to track the performance and health. Use the Grafana Inference Service Overview dashboard to monitor Pod startup delays or revision discrepancies.

Leverage scale to zero capabilities

One of the standout features of Knative is its ability to scale services down to zero when they are not in use.

This is particularly beneficial for customers who manage multiple models. Instead of implementing complex logic within a single service to switch between models dynamically, we recommend deploying each distinct model as its own Knative service.

With Knative's scale-to-zero feature, maintaining a model that hasn't been accessed for a while incurs no additional costs.
When a client accesses a model for the first time, Knative can scale from zero and be ready to serve requests within approximately 20 seconds. The time it takes to deploy a container is not billed when scaling from zero.

Avoid external queues

Knative comes with its own built-in queuing system for each service. Therefore, we advise against using external queues in front of a Knative service. Doing so can cause client requests to time out before processing begin. To avoid bottlenecks and ensuring timely processing, it's more efficient to allow Knative's internal queue to handle the requests.

Use a middleware service

We typically recommend using a middleware service that routes incoming client requests to the appropriate Knative service. A basic NGINX proxy based on URL paths can redirect requests to the corresponding Knative service.

For more complex requirements, such as account tracking, billing, or other custom logic, an advanced middleware service can be developed to perform a variety of tasks while also directing requests to the correct Knative service.

Adjust scaling for traffic patterns

Define how aggressively a service should scale up or down based on metrics like CPU utilization or request rate. The scaling behavior is controlled by setting various parameters in the Knative Serving configuration.

For services that experience rapid fluctuations in traffic, it's advisable to lower thresholds by 5% to 10%. For example, a service with a smooth scaling traffic pattern might scale up when it reaches 80% CPU utilization, while a service with rapid fluctuations might trigger scaling actions at 70% CPU utilization.

This extra margin gives Knative enough time to spin up additional Pods when the service starts to reach its capacity limits, and ensures that new Pods are ready to handle incoming requests before the existing Pods become overwhelmed.

Balance scale-up and scale-down

It's important to find a balance between how aggressively a service scales up in response to increased load and how quickly it scales down when the load decreases.

When using a very aggressive scale-up policy, use a less aggressive scale-down policy to avoid creating a situation where Pods are frequently being created and destroyed. This constant "bouncing" of Pods can lead to system instability and increased operational overhead.

Use revisions for continuous availability

Knative excels at ensuring uninterrupted service availability, particularly when rolling out new revisions. When deploying new code or making changes to a service, Knative keeps the old revision running until the new one is ready.

This readiness is determined by a configurable percentage threshold. The readiness percentage is a crucial parameter that dictates when Knative should switch traffic from the old revision to the new one. This percentage can be as low as 10% or as high as 100%, depending on the requirements. For a non-production environment that doesn't require the old service, set the readiness percentage to 100%. This will immediately shut down the old revision when the new one is ready.

Controlling how quickly an old revision should be scaled down is particularly important when operating at a large scale with limited Pod availability to avoid deadlock.

If a service runs on a 500 Pods and only 300 extra Pods are available, setting an aggressive readiness percentage can lead to a deadlock. If the readiness percentage is set to 100%, the new revision can't reach readiness due to insufficient resources, and the old revision won't shut down.

To resolve the deadlock, manually deleting the old Revision will cause the associated Pods to shut down.

Use resource selectors

In some cases, the preferred GPU type may not be available in the desired region. Setting a fallback node affinity allows Knative to use an alternative GPU type, or an alternate region.

For example, if the model works with a 24GB or 16GB GPU, but an A40 is preferred for speed, it's possible to choose an A5000 as the fallback option to ensure the service is available even when the preferred GPU is unavailable.

Example

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu.nvidia.com/class
            operator: In
            values:
              - A40
              - A5000

If the GPU type is more important than the region where the Pods run, set a region fallback.

This example prefers CoreWeave's LGA1 datacenter, but will fallback to ORD1 if no A40s are available in LGA1.

Example

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
            preference:
              matchExpressions:
                - key: gpu.nvidia.com/class
                  operator: In
                  values:
                    - A40
        - weight: 80
            preference:
                - key: topology.kubernetes.io/region
                  operator: In
                  values:
                    - LGA1
        - weight: 60
            preference:
                - key: topology.kubernetes.io/region
                  operator: In
                  values:
                    - ORD1

Additional resources

Following these best practices will optimize CoreWeave's inference engine for performance, scalability, and cost-efficiency. Learn more about Knative with our guides below, and if you have more questions, please contact our support team.

About the Inference stack

How Autoscaling works

Inference how-to guides

Don't package models in containers​

Optimize the container​

Container startup sequence​

Shutdown containers gracefully​

Handle client cancellation requests promptly​

Probe configuration​

Liveness probe​

Readiness probe​

Concurrency parameters and batching strategies​

Hard concurrency​

Soft concurrency​

Target utilization​

Batching strategies​

Hard and soft concurrency​

Concurrency considerations with in-flight batching​

Container-level batching​

Autoscaling​

Modes​

Stable mode window​

Panic mode window​

Scale rates​

Scale up rate​

Scale down rate​

Scale bounds​

Lower bound​

Upper bound​

Initial scale​

Handling long-running jobs​

Leverage scale to zero capabilities​

Avoid external queues​

Use a middleware service​

Adjust scaling for traffic patterns​

Balance scale-up and scale-down​

Use revisions for continuous availability​

Use resource selectors​

Additional resources​

Don't package models in containers

Optimize the container

Container startup sequence

Shutdown containers gracefully

Handle client cancellation requests promptly

Probe configuration

Liveness probe

Readiness probe

Concurrency parameters and batching strategies

Hard concurrency

Soft concurrency

Target utilization

Batching strategies

Hard and soft concurrency

Concurrency considerations with in-flight batching

Container-level batching

Autoscaling

Modes

Stable mode window

Panic mode window

Scale rates

Scale up rate

Scale down rate

Scale bounds

Lower bound

Upper bound

Initial scale

Handling long-running jobs

Leverage scale to zero capabilities

Avoid external queues

Use a middleware service

Adjust scaling for traffic patterns

Balance scale-up and scale-down

Use revisions for continuous availability

Use resource selectors

Additional resources