> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Scaling

> Configure autoscaling and reserve capacity for inference deployments

CoreWeave Inference supports autoscaling to match your workload demands. This page explains how to configure autoscaling and reserve GPU capacity for your inference deployments. Your models respond to changing traffic while the resources you need stay available. For pricing details, see [Billing](/products/inference/billing).

## Autoscaling

Autoscaling adjusts the number of replicas serving a deployment so that capacity tracks demand without manual intervention. Configure each [deployment](/products/inference/models) with autoscaling parameters that control how replicas scale in response to demand. CoreWeave manages the autoscaling logic, monitors request load, and adjusts replica count within your configured bounds.

### Scaling parameters

Use the following parameters to set the boundaries and behavior the autoscaler applies to your deployment. The `autoscaling` field on a deployment controls scaling behavior:

| Field             | Required | Description                                                                                                                                                             |
| ----------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `min`             | Yes      | Minimum number of replicas that are always running. Must be at least `1`. Scale-to-zero is not supported.                                                               |
| `max`             | Yes      | Maximum number of replicas. Must be greater than or equal to `min`. To disable autoscaling, set `max` equal to `min`.                                                   |
| `priority`        | No       | Scaling priority relative to other deployments, from 0 to 1000. Higher values receive scaling preference during resource contention.                                    |
| `concurrency`     | No       | Target concurrent requests per replica. Controls the latency-throughput tradeoff: lower values reduce latency, higher values increase throughput. Must be at least `1`. |
| `capacityClasses` | No       | The capacity types to use for scaling. Options: `CAPACITY_CLASS_RESERVED`, `CAPACITY_CLASS_ON_DEMAND`.                                                                  |

### How autoscaling works

CoreWeave uses request signals to determine when to add or remove replicas, and uses priority to break ties when deployments contend for GPUs. CoreWeave monitors request queue depth and active request counts across your deployment's replicas. When demand exceeds the current capacity, the autoscaler adds replicas up to `max`. When demand drops, the autoscaler removes replicas down to `min`.

Scaling priority (`priority`) determines which deployments scale first when multiple deployments contend for the same GPU resources. A deployment with `priority: 1000` scales before one with `priority: 100`.

### Best practices

Follow these guidelines to optimize autoscaling for your workloads:

* **Set `min` based on latency requirements.** A higher minimum avoids cold-start delays when new requests arrive during low-traffic periods. Each replica must load model weights before it can serve requests.
* **Set `max` to control cost.** Each replica consumes GPU resources that are billed. Set the maximum to the highest replica count your budget allows.
* **Match GPU type to model size.** Choose an instance type with enough GPU memory to fit your model weights and the inference runtime's working memory. Over-provisioning wastes resources, and under-provisioning causes out-of-memory failures.
* **Use `concurrency` to tune latency.** For latency-sensitive workloads, set a lower concurrency target so the autoscaler adds replicas sooner. For throughput-oriented workloads, set a higher value to maximize GPU utilization per replica.

## Capacity claims

Autoscaling adjusts replica count within available resources, but doesn't guarantee that GPU capacity is available. For workloads that require guaranteed GPU capacity, create capacity claims to reserve hardware resources. Capacity claims ensure that infrastructure is available for your deployments, even during periods of high demand.

### How capacity claims work

A capacity claim reserves a specified number of GPU instances in a given zone. When you create a deployment with `capacityClasses` set to `CAPACITY_CLASS_RESERVED`, CoreWeave schedules the deployment's replicas onto your reserved capacity.

### Capacity claim configuration

Use the following fields to describe the hardware you want to reserve. The `resources` field on a capacity claim specifies what to reserve:

| Field           | Required | Description                                                                                    |
| --------------- | -------- | ---------------------------------------------------------------------------------------------- |
| `instanceId`    | Yes      | The instance type to reserve (case-insensitive). Must be valid in at least one specified zone. |
| `instanceCount` | Yes      | The number of instances to reserve.                                                            |
| `capacityType`  | Yes      | The capacity type: `CAPACITY_TYPE_SERVERLESS` or `CAPACITY_TYPE_CUSTOMER`.                     |
| `zones`         | Yes      | The availability zones for the reservation. The order indicates allocation preference.         |

### Capacity claim status

After creating a capacity claim, check its status to see how many instances are allocated:

| Field                | Description                                                            |
| -------------------- | ---------------------------------------------------------------------- |
| `allocatedInstances` | The number of instances currently allocated and available.             |
| `pendingInstances`   | The number of instances being provisioned (Nodes joining the cluster). |

### Manage capacity claims

Create, update, and delete capacity claims through the API rather than the deployment configuration. Manage capacity claims through the [CoreWeave Inference API](/products/inference/reference/api-overview). For request and response schemas, see the [CapacityClaimService](/products/inference/reference/capacityclaimservice/list-capacity-claims) pages in the API reference. The parameters endpoint returns the available instance types per zone under `zoneInstanceTypes`.
