Autoscaling
Autoscaling adjusts the number of replicas serving a deployment so that capacity tracks demand without manual intervention. Configure each deployment with autoscaling parameters that control how replicas scale in response to demand. CoreWeave manages the autoscaling logic, monitors request load, and adjusts replica count within your configured bounds.Scaling parameters
Use the following parameters to set the boundaries and behavior the autoscaler applies to your deployment. Theautoscaling field on a deployment controls scaling behavior:
| Field | Required | Description |
|---|---|---|
min | Yes | Minimum number of replicas that are always running. Must be at least 1. Scale-to-zero is not supported. |
max | Yes | Maximum number of replicas. Must be greater than or equal to min. To disable autoscaling, set max equal to min. |
priority | No | Scaling priority relative to other deployments, from 0 to 1000. Higher values receive scaling preference during resource contention. |
concurrency | No | Target concurrent requests per replica. Controls the latency-throughput tradeoff: lower values reduce latency, higher values increase throughput. Must be at least 1. |
capacityClasses | No | The capacity types to use for scaling. Options: CAPACITY_CLASS_RESERVED, CAPACITY_CLASS_ON_DEMAND. |
How autoscaling works
CoreWeave uses request signals to determine when to add or remove replicas, and uses priority to break ties when deployments contend for GPUs. CoreWeave monitors request queue depth and active request counts across your deployment’s replicas. When demand exceeds the current capacity, the autoscaler adds replicas up tomax. When demand drops, the autoscaler removes replicas down to min.
Scaling priority (priority) determines which deployments scale first when multiple deployments contend for the same GPU resources. A deployment with priority: 1000 scales before one with priority: 100.
Best practices
Follow these guidelines to optimize autoscaling for your workloads:- Set
minbased on latency requirements. A higher minimum avoids cold-start delays when new requests arrive during low-traffic periods. Each replica must load model weights before it can serve requests. - Set
maxto control cost. Each replica consumes GPU resources that are billed. Set the maximum to the highest replica count your budget allows. - Match GPU type to model size. Choose an instance type with enough GPU memory to fit your model weights and the inference runtime’s working memory. Over-provisioning wastes resources, and under-provisioning causes out-of-memory failures.
- Use
concurrencyto tune latency. For latency-sensitive workloads, set a lower concurrency target so the autoscaler adds replicas sooner. For throughput-oriented workloads, set a higher value to maximize GPU utilization per replica.
Capacity claims
Autoscaling adjusts replica count within available resources, but doesn’t guarantee that GPU capacity is available. For workloads that require guaranteed GPU capacity, create capacity claims to reserve hardware resources. Capacity claims ensure that infrastructure is available for your deployments, even during periods of high demand.How capacity claims work
A capacity claim reserves a specified number of GPU instances in an Availability Zone. Reserving capacity is separate from using it: the claim only sets hardware aside, while a deployment’scapacityClasses setting controls scheduling.
Capacity types
ThecapacityType field determines where the reserved instances come from. Regardless of type, you pay for the reserved instances for as long as the claim exists, whether or not a deployment is actively using them.
CAPACITY_TYPE_SERVERLESS: Reserves instances from CoreWeave’s managed inference capacity pool. These instances are billed on demand at GPU-hour rates. Availability depends on the capacity CoreWeave has free in the requested zone.CAPACITY_TYPE_CUSTOMER: Reserves capacity against your organization’s own instance quota. CoreWeave fulfills the claim by joining Nodes to the inference cluster, which usually takes 10 to 20 minutes and is reflected in the claim’spendingInstancescount. You must have enough instance quota available in the requested zone, or the claim is rejected until quota frees up. These instances are billed like the rest of your organization’s capacity, not at an inference-specific rate: those within your reserved quantity are billed at your reserved rate, and any beyond it are billed on demand. See Billing.
Capacity classes
A deployment’scapacityClasses setting controls which capacity its replicas can schedule onto. A claim sets capacity aside, but a replica runs on it only when the deployment’s capacityClasses value allows it:
capacityClasses value | Where replicas can schedule |
|---|---|
CAPACITY_CLASS_ON_DEMAND | Unclaimed on-demand capacity only. |
CAPACITY_CLASS_RESERVED | Reserved capacity from CAPACITY_TYPE_SERVERLESS and CAPACITY_TYPE_CUSTOMER claims. |
| Unset | All capacity: unclaimed on-demand, plus CAPACITY_TYPE_SERVERLESS and CAPACITY_TYPE_CUSTOMER claims. |
Capacity claim configuration
Use the following fields to describe the hardware you want to reserve. Theresources field on a capacity claim specifies what to reserve:
| Field | Required | Description |
|---|---|---|
instanceId | Yes | The instance type to reserve (case-insensitive). Must be valid in the specified zone. |
instanceCount | Yes | The number of instances to reserve. |
capacityType | Yes | The capacity type, CAPACITY_TYPE_SERVERLESS or CAPACITY_TYPE_CUSTOMER. See Capacity types. |
zones | Yes | The Availability Zone to reserve in. A claim can currently specify only one zone. |
instanceId must be an instance type that’s available for capacity claims in the zone you request. Query the capacity claim parameters endpoint to see the available instance types per zone, returned under zoneInstanceTypes. Replace [API-TOKEN] with your API access token:
Capacity claim status
After creating a capacity claim, check its status to see how many instances are allocated:| Field | Description |
|---|---|
allocatedInstances | The number of instances currently allocated and available. |
pendingInstances | The number of instances being provisioned (Nodes joining the cluster). |
Update and delete behavior
You can change a capacity claim’s instance count after it’s created. ForCAPACITY_TYPE_SERVERLESS claims, reducing the instance count can take effect on a delay, because CoreWeave limits how often these claims can scale down. Pending changes appear in the claim’s status.
Deleting a capacity claim evicts any deployments running on its instances and reschedules them onto other capacity if any is available. For CAPACITY_TYPE_CUSTOMER claims, the released instances return to your quota, where you can reallocate them to other workloads such as CKS clusters.
Manage capacity claims
Create, update, and delete capacity claims through the API rather than the deployment configuration. Manage capacity claims through the CoreWeave Inference API. For request and response schemas, see the CapacityClaimService pages in the API reference. The parameters endpoint returns the available instance types per zone underzoneInstanceTypes.