Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
A deployment in CoreWeave Inference configures a model serving instance that runs your model on dedicated GPU infrastructure. Each deployment specifies the model weights, inference runtime, GPU type, and scaling parameters.
This page covers how to configure model weights, select an inference runtime, choose GPU resources, and manage traffic across deployments.
Bring Your Own Weights
CoreWeave Inference uses a Bring Your Own Weights (BYOW) model. You provide the model artifacts, and CoreWeave handles the infrastructure required to serve them.
Model weights must be stored in a CoreWeave Object Storage bucket. When you create a deployment, you must specify the bucket name and the path to the model directory within the bucket.
When a deployment starts, CoreWeave loads your model weights onto the selected GPU infrastructure and serves requests through the associated gateway.
CoreWeave Inference doesn’t pull model weights directly from external sources such as Hugging Face. Download your model weights and upload them to a CoreWeave Object Storage bucket before creating a deployment.
Grant inference access to your bucket
Dedicated Inference reads your model weights from CoreWeave Object Storage using a dedicated service account. Attach the following bucket policy to your weights bucket so the service can list and read its contents.
Replace [BUCKET-NAME] with the name of the bucket containing your model weights.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowInference",
"Effect": "Allow",
"Principal": {
"CW": [
"arn:aws:iam::cw4637:coreweave/uvAGGQSxxXeeQBJzcGsD9"
]
},
"Action": [
"s3:ListBucket",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::[BUCKET-NAME]",
"arn:aws:s3:::[BUCKET-NAME]/*"
]
}
]
}
If you store models only under a specific prefix, you can narrow object access to that prefix:
"Resource": [
"arn:aws:s3:::[BUCKET-NAME]",
"arn:aws:s3:::[BUCKET-NAME]/path/to/models/*"
]
The bucket-level ARN (arn:aws:s3:::[BUCKET-NAME]) must remain as-is for s3:ListBucket.
Save the policy to a file (for example, inference-bucket-policy.json) and apply it with the AWS CLI. If you have configured a named AWS CLI profile for CoreWeave, pass it with --profile [PROFILE-NAME]. Omit the flag if your CoreWeave credentials are in the default profile.
aws s3api put-bucket-policy \
--bucket [BUCKET-NAME] \
--policy file://inference-bucket-policy.json \
--endpoint-url https://cwobject.com \
--profile [PROFILE-NAME]
For other tools (s3cmd, Boto3, Terraform) and the full reference, see Manage bucket policies.
The Cloud Console doesn’t support setting bucket policies. Apply the policy with the AWS CLI, an S3 SDK, or Terraform. Organization access policies in the Console are a different mechanism and won’t grant the inference service account access to your bucket.
The Principal value in the preceding policy is the canonical CoreWeave Inference service account and is the same for all customers. If a deployment fails to load weights with a 520 or timeout despite reaching STATUS_READY, a missing bucket policy is the most common cause.
Once the policy is in place, deployments can read model weights from the bucket when they start.
Model configuration
The model field in a deployment specifies where to find the model weights and what name to use for routing:
| Field | Required | Description |
|---|
name | Yes | The model name used to route inference requests and listed in the /models endpoint. Must be 4-63 characters. |
bucket | Yes | The CoreWeave Object Storage bucket containing the model weights. |
path | Yes | The path within the bucket to the model and configuration files. |
The model name is how clients identify which deployment should handle their request. For body-based routing, the model field in the request body must match this name. For path-based routing, the model name is the first segment of the URL path.
Inference runtimes
CoreWeave manages the inference runtime. The runtime field configures which engine and version to use. You must specify an engine when creating a deployment.
Supported engines
| Engine | Description |
|---|
vllm | Inference engine for large language models. Supports continuous batching, paged attention, and tensor parallelism. |
Runtime configuration
| Field | Required | Description |
|---|
engine | Yes | The inference engine. Supported value: vllm. |
version | No | The engine version in semantic versioning (SemVer) format. Defaults to the latest version if not set. |
engineConfig | No | A map of engine-specific configuration options as key-value pairs. |
Query the available runtime versions and configuration options from the deployment parameters endpoint:
curl "${CW_BASE_URL}/v1alpha1/inference/deployments/parameters" \
-H "Authorization: Bearer ${CW_API_TOKEN}"
Engine configuration options
The engineConfig field accepts engine-specific key-value pairs that control model serving behavior. Supported keys for vllm:
| Key | Description |
|---|
max-model-len | Maximum sequence length (context window) the model can handle. Reduce this to decrease memory usage for workloads with shorter sequences. |
reasoning-parser | Parser for structured reasoning output from thinking models. Set to the parser matching your model (for example, qwen3, deepseek_r1). |
enable-auto-tool-choice | Enables automatic tool selection for tool-calling models. Set the value to an empty string to enable. |
tool-call-parser | Parser for tool call output format. Set to the parser matching your model (for example, llama3_json, qwen3_coder). |
max-num-batched-tokens | Maximum number of tokens processed in a single batch. Controls the throughput and latency tradeoff. |
structured-outputs-config.backend | Backend for structured output generation. Options: guidance, xgrammar. |
The deployment parameters endpoint returns the full list of allowed configuration keys for each engine under runtimeParameters.runtimeConfigOptions.
GPU selection
Deployments run on dedicated GPU infrastructure. The resources field configures the hardware allocation.
| Field | Required | Description |
|---|
instanceType | Yes | The GPU instance type. Query available types from the deployment parameters endpoint. |
gpuCount | Yes | The number of GPUs allocated to each replica. Allowed values: 1, 2, 4, 8, 16. CPU and RAM are automatically assigned based on GPU count. |
Query the available instance types from the deployment parameters endpoint:
curl "${CW_BASE_URL}/v1alpha1/inference/deployments/parameters" \
-H "Authorization: Bearer ${CW_API_TOKEN}"
The response includes the available instance types under resourceParameters.instanceTypes.
Choose an instance type
Match the instance type to your model’s requirements:
- Model size: Choose a GPU with enough memory to fit your model weights and the inference runtime’s working memory (KV cache).
- Throughput: Higher-end GPUs (H200, B200) provide more compute and memory bandwidth for faster inference.
- Multi-GPU: For models that exceed a single GPU’s memory, increase
gpuCount to allocate additional GPUs per replica.
Traffic weights
The traffic field controls how traffic is distributed when multiple deployments share the same model name on the same gateway.
| Field | Description |
|---|
weight | An integer from 0 to 1000. Weights across all deployments with the same model name are normalized into percentages. |
Traffic weights enable canary deployments and A/B testing. For example, if deployment A has weight: 900 and deployment B has weight: 100, deployment A receives 90% of traffic and deployment B receives 10%.
Setting weight to 0 stops traffic to a deployment without deleting it.
Disable a deployment
Set the disabled field to true to stop a deployment from serving traffic without deleting it. A disabled deployment retains its configuration. To re-enable it, set disabled to false.
Deployment lifecycle
Deployments go through the following states:
| Status | Description |
|---|
STATUS_CREATING | The deployment is being provisioned and model weights are loading. |
STATUS_READY | The deployment is ready and serving requests. |
STATUS_UPDATING | The deployment configuration is being updated. Existing replicas continue serving traffic during updates. |
STATUS_DELETING | The deployment is being removed. Existing requests are given a grace period to complete. |
STATUS_ERROR | The deployment encountered an error. Check status.conditions for details. |
STATUS_FAILED | The deployment failed to start. Check status.conditions for details. |
Manage deployments
Manage deployments through the CoreWeave Inference API. For per-operation request and response schemas, see the DeploymentService pages in the API reference.
Update requests (PATCH) require the complete deployment specification, not just the fields being changed. Omitted fields revert to their defaults.
For a step-by-step walkthrough, see the Getting started guide.