Skip to main content
CoreWeave Inference supports autoscaling to match your workload demands. This page explains how to configure autoscaling and reserve GPU capacity for your inference deployments. Your models respond to changing traffic while the resources you need stay available. For pricing details, see Billing.

Autoscaling

Autoscaling adjusts the number of replicas serving a deployment so that capacity tracks demand without manual intervention. Configure each deployment with autoscaling parameters that control how replicas scale in response to demand. CoreWeave manages the autoscaling logic, monitors request load, and adjusts replica count within your configured bounds.

Scaling parameters

Use the following parameters to set the boundaries and behavior the autoscaler applies to your deployment. The autoscaling field on a deployment controls scaling behavior:
FieldRequiredDescription
minYesMinimum number of replicas that are always running. Must be at least 1. Scale-to-zero is not supported.
maxYesMaximum number of replicas. Must be greater than or equal to min. To disable autoscaling, set max equal to min.
priorityNoScaling priority relative to other deployments, from 0 to 1000. Higher values receive scaling preference during resource contention.
concurrencyNoTarget concurrent requests per replica. Controls the latency-throughput tradeoff: lower values reduce latency, higher values increase throughput. Must be at least 1.
capacityClassesNoThe capacity types to use for scaling. Options: CAPACITY_CLASS_RESERVED, CAPACITY_CLASS_ON_DEMAND.

How autoscaling works

CoreWeave uses request signals to determine when to add or remove replicas, and uses priority to break ties when deployments contend for GPUs. CoreWeave monitors request queue depth and active request counts across your deployment’s replicas. When demand exceeds the current capacity, the autoscaler adds replicas up to max. When demand drops, the autoscaler removes replicas down to min. Scaling priority (priority) determines which deployments scale first when multiple deployments contend for the same GPU resources. A deployment with priority: 1000 scales before one with priority: 100.

Best practices

Follow these guidelines to optimize autoscaling for your workloads:
  • Set min based on latency requirements. A higher minimum avoids cold-start delays when new requests arrive during low-traffic periods. Each replica must load model weights before it can serve requests.
  • Set max to control cost. Each replica consumes GPU resources that are billed. Set the maximum to the highest replica count your budget allows.
  • Match GPU type to model size. Choose an instance type with enough GPU memory to fit your model weights and the inference runtime’s working memory. Over-provisioning wastes resources, and under-provisioning causes out-of-memory failures.
  • Use concurrency to tune latency. For latency-sensitive workloads, set a lower concurrency target so the autoscaler adds replicas sooner. For throughput-oriented workloads, set a higher value to maximize GPU utilization per replica.

Capacity claims

Autoscaling adjusts replica count within available resources, but doesn’t guarantee that GPU capacity is available. For workloads that require guaranteed GPU capacity, create capacity claims to reserve hardware resources. Capacity claims ensure that infrastructure is available for your deployments, even during periods of high demand.

How capacity claims work

A capacity claim reserves a specified number of GPU instances in an Availability Zone. Reserving capacity is separate from using it: the claim only sets hardware aside, while a deployment’s capacityClasses setting controls scheduling.

Capacity types

The capacityType field determines where the reserved instances come from. Regardless of type, you pay for the reserved instances for as long as the claim exists, whether or not a deployment is actively using them.
  • CAPACITY_TYPE_SERVERLESS: Reserves instances from CoreWeave’s managed inference capacity pool. These instances are billed on demand at GPU-hour rates. Availability depends on the capacity CoreWeave has free in the requested zone.
  • CAPACITY_TYPE_CUSTOMER: Reserves capacity against your organization’s own instance quota. CoreWeave fulfills the claim by joining Nodes to the inference cluster, which usually takes 10 to 20 minutes and is reflected in the claim’s pendingInstances count. You must have enough instance quota available in the requested zone, or the claim is rejected until quota frees up. These instances are billed like the rest of your organization’s capacity, not at an inference-specific rate: those within your reserved quantity are billed at your reserved rate, and any beyond it are billed on demand. See Billing.

Capacity classes

A deployment’s capacityClasses setting controls which capacity its replicas can schedule onto. A claim sets capacity aside, but a replica runs on it only when the deployment’s capacityClasses value allows it:
capacityClasses valueWhere replicas can schedule
CAPACITY_CLASS_ON_DEMANDUnclaimed on-demand capacity only.
CAPACITY_CLASS_RESERVEDReserved capacity from CAPACITY_TYPE_SERVERLESS and CAPACITY_TYPE_CUSTOMER claims.
UnsetAll capacity: unclaimed on-demand, plus CAPACITY_TYPE_SERVERLESS and CAPACITY_TYPE_CUSTOMER claims.

Capacity claim configuration

Use the following fields to describe the hardware you want to reserve. The resources field on a capacity claim specifies what to reserve:
FieldRequiredDescription
instanceIdYesThe instance type to reserve (case-insensitive). Must be valid in the specified zone.
instanceCountYesThe number of instances to reserve.
capacityTypeYesThe capacity type, CAPACITY_TYPE_SERVERLESS or CAPACITY_TYPE_CUSTOMER. See Capacity types.
zonesYesThe Availability Zone to reserve in. A claim can currently specify only one zone.
The instanceId must be an instance type that’s available for capacity claims in the zone you request. Query the capacity claim parameters endpoint to see the available instance types per zone, returned under zoneInstanceTypes. Replace [API-TOKEN] with your API access token:
curl "https://api.coreweave.com/v1alpha1/inference/capacityclaims/parameters" \
  -H "Authorization: Bearer [API-TOKEN]"
For the full list of GPU instance types and their specifications, see GPU instances.

Capacity claim status

After creating a capacity claim, check its status to see how many instances are allocated:
FieldDescription
allocatedInstancesThe number of instances currently allocated and available.
pendingInstancesThe number of instances being provisioned (Nodes joining the cluster).

Update and delete behavior

You can change a capacity claim’s instance count after it’s created. For CAPACITY_TYPE_SERVERLESS claims, reducing the instance count can take effect on a delay, because CoreWeave limits how often these claims can scale down. Pending changes appear in the claim’s status. Deleting a capacity claim evicts any deployments running on its instances and reschedules them onto other capacity if any is available. For CAPACITY_TYPE_CUSTOMER claims, the released instances return to your quota, where you can reallocate them to other workloads such as CKS clusters.

Manage capacity claims

Create, update, and delete capacity claims through the API rather than the deployment configuration. Manage capacity claims through the CoreWeave Inference API. For request and response schemas, see the CapacityClaimService pages in the API reference. The parameters endpoint returns the available instance types per zone under zoneInstanceTypes.
Last modified on June 10, 2026