Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

CoreWeave Inference supports autoscaling to match your workload demands. This page explains how to configure autoscaling and reserve GPU capacity for your inference deployments. For pricing details, see Billing.

Autoscaling

Each deployment can be configured with autoscaling parameters that control how replicas scale in response to demand. CoreWeave manages the autoscaling logic: it monitors request load and adjusts replica count within your configured bounds.

Scaling parameters

The autoscaling field on a deployment controls scaling behavior:
FieldRequiredDescription
minYesMinimum number of replicas that are always running. Must be at least 1. Scale-to-zero is not supported.
maxYesMaximum number of replicas. Must be greater than or equal to min. To disable autoscaling, set max equal to min.
priorityNoScaling priority relative to other deployments, from 0 to 1000. Higher values receive scaling preference during resource contention.
concurrencyNoTarget concurrent requests per replica. Controls the latency-throughput tradeoff: lower values reduce latency, higher values increase throughput. Must be at least 1.
capacityClassesNoThe capacity types to use for scaling. Options: CAPACITY_CLASS_RESERVED, CAPACITY_CLASS_ON_DEMAND.

How autoscaling works

CoreWeave monitors request queue depth and active request counts across your deployment’s replicas. When demand exceeds the current capacity, the autoscaler adds replicas up to max. When demand drops, it removes replicas down to min. Scaling priority (priority) determines which deployments scale first when multiple deployments compete for the same GPU resources. A deployment with priority: 1000 scales before one with priority: 100.

Best practices

Follow these guidelines to optimize autoscaling for your workloads.
  • Set min based on latency requirements. A higher minimum avoids cold-start delays when new requests arrive during low-traffic periods. Each replica must load model weights before it can serve requests.
  • Set max to control cost. Each replica consumes GPU resources that are billed. Set the maximum to the highest replica count your budget allows.
  • Match GPU type to model size. Choose an instance type with enough GPU memory to fit your model weights and the inference runtime’s working memory. Over-provisioning wastes resources, while under-provisioning causes out-of-memory failures.
  • Use concurrency to tune latency. For latency-sensitive workloads, set a lower concurrency target so the autoscaler adds replicas sooner. For throughput-oriented workloads, set a higher value to maximize GPU utilization per replica.

Capacity claims

Autoscaling adjusts replica count within available resources, but does not guarantee that GPU capacity is available. For workloads that require guaranteed GPU capacity, create capacity claims to reserve hardware resources. Capacity claims ensure that infrastructure is available for your deployments, even during periods of high demand.

How capacity claims work

A capacity claim reserves a specified number of GPU instances in a given zone. When you create a deployment with capacityClasses set to CAPACITY_CLASS_RESERVED, the deployment’s replicas are scheduled onto your reserved capacity.

Capacity claim configuration

The resources field on a capacity claim specifies what to reserve:
FieldRequiredDescription
instanceIdYesThe instance type to reserve (case-insensitive). Must be valid in at least one specified zone.
instanceCountYesThe number of instances to reserve.
capacityTypeYesThe capacity type: CAPACITY_TYPE_SERVERLESS or CAPACITY_TYPE_CUSTOMER.
zonesYesThe availability zones for the reservation. The order implies allocation preference.

Capacity claim status

After creating a capacity claim, check its status to see how many instances are allocated:
FieldDescription
allocatedInstancesThe number of instances currently allocated and available.
pendingInstancesThe number of instances being provisioned (Nodes joining the cluster).

Manage capacity claims

Manage capacity claims through the CoreWeave Inference API. For per-operation request and response schemas, see the CapacityClaimService pages in the API reference. The parameters endpoint returns the available instance types per zone under zoneInstanceTypes.
Last modified on May 6, 2026