Scaling - CoreWeave Docs

CoreWeave Inference supports autoscaling to match your workload demands. This page explains how to configure autoscaling and reserve GPU capacity for your inference deployments. For pricing details, see Billing.

Autoscaling

Each deployment can be configured with autoscaling parameters that control how replicas scale in response to demand. CoreWeave manages the autoscaling logic: it monitors request load and adjusts replica count within your configured bounds.

Scaling parameters

The autoscaling field on a deployment controls scaling behavior:

Field	Required	Description
`min`	Yes	Minimum number of replicas that are always running. Must be at least `1`. Scale-to-zero is not supported.
`max`	Yes	Maximum number of replicas. Must be greater than or equal to `min`. To disable autoscaling, set `max` equal to `min`.
`priority`	No	Scaling priority relative to other deployments, from 0 to 1000. Higher values receive scaling preference during resource contention.
`concurrency`	No	Target concurrent requests per replica. Controls the latency-throughput tradeoff: lower values reduce latency, higher values increase throughput. Must be at least `1`.
`capacityClasses`	No	The capacity types to use for scaling. Options: `CAPACITY_CLASS_RESERVED`, `CAPACITY_CLASS_ON_DEMAND`.

How autoscaling works

CoreWeave monitors request queue depth and active request counts across your deployment’s replicas. When demand exceeds the current capacity, the autoscaler adds replicas up to max. When demand drops, it removes replicas down to min. Scaling priority (priority) determines which deployments scale first when multiple deployments compete for the same GPU resources. A deployment with priority: 1000 scales before one with priority: 100.

Best practices

Follow these guidelines to optimize autoscaling for your workloads.

Set min based on latency requirements. A higher minimum avoids cold-start delays when new requests arrive during low-traffic periods. Each replica must load model weights before it can serve requests.
Set max to control cost. Each replica consumes GPU resources that are billed. Set the maximum to the highest replica count your budget allows.
Match GPU type to model size. Choose an instance type with enough GPU memory to fit your model weights and the inference runtime’s working memory. Over-provisioning wastes resources, while under-provisioning causes out-of-memory failures.
Use concurrency to tune latency. For latency-sensitive workloads, set a lower concurrency target so the autoscaler adds replicas sooner. For throughput-oriented workloads, set a higher value to maximize GPU utilization per replica.

Capacity claims

Autoscaling adjusts replica count within available resources, but does not guarantee that GPU capacity is available. For workloads that require guaranteed GPU capacity, create capacity claims to reserve hardware resources. Capacity claims ensure that infrastructure is available for your deployments, even during periods of high demand.

How capacity claims work

A capacity claim reserves a specified number of GPU instances in a given zone. When you create a deployment with capacityClasses set to CAPACITY_CLASS_RESERVED, the deployment’s replicas are scheduled onto your reserved capacity.

Capacity claim configuration

The resources field on a capacity claim specifies what to reserve:

Field	Required	Description
`instanceId`	Yes	The instance type to reserve (case-insensitive). Must be valid in at least one specified zone.
`instanceCount`	Yes	The number of instances to reserve.
`capacityType`	Yes	The capacity type: `CAPACITY_TYPE_SERVERLESS` or `CAPACITY_TYPE_CUSTOMER`.
`zones`	Yes	The availability zones for the reservation. The order implies allocation preference.

Capacity claim status

After creating a capacity claim, check its status to see how many instances are allocated:

Field	Description
`allocatedInstances`	The number of instances currently allocated and available.
`pendingInstances`	The number of instances being provisioned (Nodes joining the cluster).

Manage capacity claims

Manage capacity claims through the CoreWeave Inference API. For per-operation request and response schemas, see the CapacityClaimService pages in the API reference. The parameters endpoint returns the available instance types per zone under zoneInstanceTypes.

Documentation Index

​Autoscaling

​Scaling parameters

​How autoscaling works

​Best practices

​Capacity claims

​How capacity claims work

​Capacity claim configuration

​Capacity claim status

​Manage capacity claims