> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Scaling

> Configure autoscaling and reserve capacity for inference deployments

CoreWeave Inference supports autoscaling to match your workload demands. This page explains how to configure autoscaling and reserve GPU capacity for your inference deployments. Your models respond to changing traffic while the resources you need stay available. For pricing details, see [Billing](/products/inference/billing).

## Autoscaling

Autoscaling adjusts the number of replicas serving a deployment so that capacity tracks demand without manual intervention. Configure each [deployment](/products/inference/models) with autoscaling parameters that control how replicas scale in response to demand. CoreWeave manages the autoscaling logic, monitors request load, and adjusts replica count within your configured bounds.

### Scaling parameters

Use the following parameters to set the boundaries and behavior the autoscaler applies to your deployment. The `autoscaling` field on a deployment controls scaling behavior:

| Field             | Required | Description                                                                                                                                                             |
| ----------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `min`             | Yes      | Minimum number of replicas that are always running. Must be at least `1`. Scale-to-zero is not supported.                                                               |
| `max`             | Yes      | Maximum number of replicas. Must be greater than or equal to `min`. To disable autoscaling, set `max` equal to `min`.                                                   |
| `priority`        | No       | Scaling priority relative to other deployments, from 0 to 1000. Higher values receive scaling preference during resource contention.                                    |
| `concurrency`     | No       | Target concurrent requests per replica. Controls the latency-throughput tradeoff: lower values reduce latency, higher values increase throughput. Must be at least `1`. |
| `capacityClasses` | No       | The capacity types to use for scaling. Options: `CAPACITY_CLASS_RESERVED`, `CAPACITY_CLASS_ON_DEMAND`.                                                                  |

### How autoscaling works

CoreWeave uses request signals to determine when to add or remove replicas, and uses priority to break ties when deployments contend for GPUs. CoreWeave monitors request queue depth and active request counts across your deployment's replicas. When demand exceeds the current capacity, the autoscaler adds replicas up to `max`. When demand drops, the autoscaler removes replicas down to `min`.

Scaling priority (`priority`) determines which deployments scale first when multiple deployments contend for the same GPU resources. A deployment with `priority: 1000` scales before one with `priority: 100`.

### Best practices

Follow these guidelines to optimize autoscaling for your workloads:

* **Set `min` based on latency requirements.** A higher minimum avoids cold-start delays when new requests arrive during low-traffic periods. Each replica must load model weights before it can serve requests.
* **Set `max` to control cost.** Each replica consumes GPU resources that are billed. Set the maximum to the highest replica count your budget allows.
* **Match GPU type to model size.** Choose an instance type with enough GPU memory to fit your model weights and the inference runtime's working memory. Over-provisioning wastes resources, and under-provisioning causes out-of-memory failures.
* **Use `concurrency` to tune latency.** For latency-sensitive workloads, set a lower concurrency target so the autoscaler adds replicas sooner. For throughput-oriented workloads, set a higher value to maximize GPU utilization per replica.

## Capacity claims

Autoscaling adjusts replica count within available resources, but doesn't guarantee that GPU capacity is available. For workloads that require guaranteed GPU capacity, create capacity claims to reserve hardware resources. Capacity claims ensure that infrastructure is available for your deployments, even during periods of high demand.

### How capacity claims work

A capacity claim reserves a specified number of GPU instances in an [Availability Zone](/platform/regions/all-availability-zones). Reserving capacity is separate from using it: the claim only sets hardware aside, while a deployment's [`capacityClasses`](#scaling-parameters) setting controls scheduling.

### Capacity types

The `capacityType` field determines where the reserved instances come from. Regardless of type, you pay for the reserved instances for as long as the claim exists, whether or not a deployment is actively using them.

* **`CAPACITY_TYPE_SERVERLESS`**: Reserves instances from CoreWeave's managed inference capacity pool. These instances are [billed on demand](/products/inference/billing#billing-models) at GPU-hour rates. Availability depends on the capacity CoreWeave has free in the requested zone.
* **`CAPACITY_TYPE_CUSTOMER`**: Reserves capacity against your organization's own [instance quota](/products/cks/clusters/quotas). CoreWeave fulfills the claim by joining Nodes to the inference cluster, which usually takes 10 to 20 minutes and is reflected in the claim's `pendingInstances` count. You must have enough instance quota available in the requested zone, or the claim is rejected until quota frees up. These instances are billed like the rest of your organization's capacity, not at an inference-specific rate: those within your reserved quantity are billed at your reserved rate, and any beyond it are billed on demand. See [Billing](/products/inference/billing#billing-models).

### Capacity classes

A deployment's [`capacityClasses`](#scaling-parameters) setting controls which capacity its replicas can schedule onto. A claim sets capacity aside, but a replica runs on it only when the deployment's `capacityClasses` value allows it:

| `capacityClasses` value    | Where replicas can schedule                                                                             |
| -------------------------- | ------------------------------------------------------------------------------------------------------- |
| `CAPACITY_CLASS_ON_DEMAND` | Unclaimed on-demand capacity only.                                                                      |
| `CAPACITY_CLASS_RESERVED`  | Reserved capacity from `CAPACITY_TYPE_SERVERLESS` and `CAPACITY_TYPE_CUSTOMER` claims.                  |
| Unset                      | All capacity: unclaimed on-demand, plus `CAPACITY_TYPE_SERVERLESS` and `CAPACITY_TYPE_CUSTOMER` claims. |

### Capacity claim configuration

Use the following fields to describe the hardware you want to reserve. The `resources` field on a capacity claim specifies what to reserve:

| Field           | Required | Description                                                                                                       |
| --------------- | -------- | ----------------------------------------------------------------------------------------------------------------- |
| `instanceId`    | Yes      | The instance type to reserve (case-insensitive). Must be valid in the specified zone.                             |
| `instanceCount` | Yes      | The number of instances to reserve.                                                                               |
| `capacityType`  | Yes      | The capacity type, `CAPACITY_TYPE_SERVERLESS` or `CAPACITY_TYPE_CUSTOMER`. See [Capacity types](#capacity-types). |
| `zones`         | Yes      | The Availability Zone to reserve in. A claim can currently specify only one zone.                                 |

The `instanceId` must be an instance type that's available for capacity claims in the zone you request. Query the capacity claim parameters endpoint to see the available instance types per zone, returned under `zoneInstanceTypes`. Replace `[API-TOKEN]` with your [API access token](/security/authn-authz/manage-api-access-tokens):

```bash theme={"system"}
curl "https://api.coreweave.com/v1alpha1/inference/capacityclaims/parameters" \
  -H "Authorization: Bearer [API-TOKEN]"
```

For the full list of GPU instance types and their specifications, see [GPU instances](/platform/instances/gpu-instances).

### Capacity claim status

After creating a capacity claim, check its status to see how many instances are allocated:

| Field                | Description                                                            |
| -------------------- | ---------------------------------------------------------------------- |
| `allocatedInstances` | The number of instances currently allocated and available.             |
| `pendingInstances`   | The number of instances being provisioned (Nodes joining the cluster). |

### Update and delete behavior

You can change a capacity claim's instance count after it's created.

For `CAPACITY_TYPE_SERVERLESS` claims, reducing the instance count can take effect on a delay, because CoreWeave limits how often these claims can scale down. Pending changes appear in the claim's status.

Deleting a capacity claim evicts any deployments running on its instances and reschedules them onto other capacity if any is available. For `CAPACITY_TYPE_CUSTOMER` claims, the released instances return to your quota, where you can reallocate them to other workloads such as CKS clusters.

### Manage capacity claims

Create, update, and delete capacity claims through the API rather than the deployment configuration. Manage capacity claims through the [CoreWeave Inference API](/products/inference/reference/api-overview). For request and response schemas, see the [CapacityClaimService](/products/inference/reference/api-overview/capacityclaimservice/list-capacity-claims) pages in the API reference. The parameters endpoint returns the available instance types per zone under `zoneInstanceTypes`.