Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
CoreWeave Inference supports autoscaling to match your workload demands. This page explains how to configure autoscaling and reserve GPU capacity for your inference deployments. For pricing details, see Billing.
Autoscaling
Each deployment can be configured with autoscaling parameters that control how replicas scale in response to demand. CoreWeave manages the autoscaling logic: it monitors request load and adjusts replica count within your configured bounds.
Scaling parameters
The autoscaling field on a deployment controls scaling behavior:
| Field | Required | Description |
|---|
min | Yes | Minimum number of replicas that are always running. Must be at least 1. Scale-to-zero is not supported. |
max | Yes | Maximum number of replicas. Must be greater than or equal to min. To disable autoscaling, set max equal to min. |
priority | No | Scaling priority relative to other deployments, from 0 to 1000. Higher values receive scaling preference during resource contention. |
concurrency | No | Target concurrent requests per replica. Controls the latency-throughput tradeoff: lower values reduce latency, higher values increase throughput. Must be at least 1. |
capacityClasses | No | The capacity types to use for scaling. Options: CAPACITY_CLASS_RESERVED, CAPACITY_CLASS_ON_DEMAND. |
How autoscaling works
CoreWeave monitors request queue depth and active request counts across your deployment’s replicas. When demand exceeds the current capacity, the autoscaler adds replicas up to max. When demand drops, it removes replicas down to min.
Scaling priority (priority) determines which deployments scale first when multiple deployments compete for the same GPU resources. A deployment with priority: 1000 scales before one with priority: 100.
Best practices
Follow these guidelines to optimize autoscaling for your workloads.
- Set
min based on latency requirements. A higher minimum avoids cold-start delays when new requests arrive during low-traffic periods. Each replica must load model weights before it can serve requests.
- Set
max to control cost. Each replica consumes GPU resources that are billed. Set the maximum to the highest replica count your budget allows.
- Match GPU type to model size. Choose an instance type with enough GPU memory to fit your model weights and the inference runtime’s working memory. Over-provisioning wastes resources, while under-provisioning causes out-of-memory failures.
- Use
concurrency to tune latency. For latency-sensitive workloads, set a lower concurrency target so the autoscaler adds replicas sooner. For throughput-oriented workloads, set a higher value to maximize GPU utilization per replica.
Capacity claims
Autoscaling adjusts replica count within available resources, but does not guarantee that GPU capacity is available. For workloads that require guaranteed GPU capacity, create capacity claims to reserve hardware resources. Capacity claims ensure that infrastructure is available for your deployments, even during periods of high demand.
How capacity claims work
A capacity claim reserves a specified number of GPU instances in a given zone. When you create a deployment with capacityClasses set to CAPACITY_CLASS_RESERVED, the deployment’s replicas are scheduled onto your reserved capacity.
Capacity claim configuration
The resources field on a capacity claim specifies what to reserve:
| Field | Required | Description |
|---|
instanceId | Yes | The instance type to reserve (case-insensitive). Must be valid in at least one specified zone. |
instanceCount | Yes | The number of instances to reserve. |
capacityType | Yes | The capacity type: CAPACITY_TYPE_SERVERLESS or CAPACITY_TYPE_CUSTOMER. |
zones | Yes | The availability zones for the reservation. The order implies allocation preference. |
Capacity claim status
After creating a capacity claim, check its status to see how many instances are allocated:
| Field | Description |
|---|
allocatedInstances | The number of instances currently allocated and available. |
pendingInstances | The number of instances being provisioned (Nodes joining the cluster). |
Manage capacity claims
Manage capacity claims through the CoreWeave Inference API. For per-operation request and response schemas, see the CapacityClaimService pages in the API reference. The parameters endpoint returns the available instance types per zone under zoneInstanceTypes.