Skip to main content
A deployment in CoreWeave Inference configures a model serving instance that runs your model on dedicated GPU infrastructure. Each deployment specifies the model weights, inference runtime, GPU type, and scaling parameters. This page is for users who create or manage CoreWeave Inference deployments. It covers how to configure model weights, select an inference runtime, choose GPU resources, and manage traffic across deployments so you can serve your models reliably on dedicated infrastructure.

Bring your own weights

CoreWeave Inference uses a Bring Your Own Weights (BYOW) model. You provide the model artifacts, and CoreWeave handles the infrastructure required to serve them. Model weights must be stored in a CoreWeave Object Storage bucket. When you create a deployment, you must specify the bucket name and the path to the model directory within the bucket. When a deployment starts, CoreWeave loads your model weights onto the selected GPU infrastructure and serves requests through the associated gateway.
CoreWeave Inference doesn’t pull model weights directly from external sources such as Hugging Face. Download your model weights and upload them to an Object Storage bucket before creating a deployment.

Grant inference access to your bucket

Inference reads your model weights from Object Storage using a dedicated service account. Attach the following bucket policy to your weights bucket so the service can list and read its contents. Replace [BUCKET-NAME] with the name of the bucket containing your model weights.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowInference",
      "Effect": "Allow",
      "Principal": {
        "CW": [
          "arn:aws:iam::cw4637:coreweave/uvAGGQSxxXeeQBJzcGsD9"
        ]
      },
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::[BUCKET-NAME]",
        "arn:aws:s3:::[BUCKET-NAME]/*"
      ]
    }
  ]
}
If you store models only under a specific prefix, you can narrow object access to that prefix:
"Resource": [
  "arn:aws:s3:::[BUCKET-NAME]",
  "arn:aws:s3:::[BUCKET-NAME]/path/to/models/*"
]
The bucket-level ARN (arn:aws:s3:::[BUCKET-NAME]) must remain as-is for s3:ListBucket. Save the policy to a file (for example, inference-bucket-policy.json) and apply it with the AWS CLI. If you’ve configured a named AWS CLI profile for CoreWeave, pass it with --profile [PROFILE-NAME]. Omit the flag if your CoreWeave credentials are in the default profile.
aws s3api put-bucket-policy \
  --bucket [BUCKET-NAME] \
  --policy file://inference-bucket-policy.json \
  --endpoint-url https://cwobject.com \
  --profile [PROFILE-NAME]
For other tools (s3cmd, Boto3, Terraform) and the full reference, see Manage bucket policies.
The Cloud Console doesn’t support setting bucket policies. Apply the policy with the AWS CLI, an S3 SDK, or Terraform. Organization access policies in the Console are a different mechanism and won’t grant the inference service account access to your bucket.
The Principal value in the preceding policy is the canonical CoreWeave Inference service account and is the same for all customers. If a deployment fails to load weights with a 520 or timeout despite reaching STATUS_READY, a missing bucket policy is the most common cause.
After the policy is in place, deployments can read model weights from the bucket when they start.

Model configuration

The model field in a deployment specifies where to find the model weights and what name to use for routing:
FieldRequiredDescription
nameYesThe model name used to route inference requests and listed in the /models endpoint. Must be 4-63 characters.
bucketYesThe Object Storage bucket containing the model weights.
pathYesThe path within the bucket to the model and configuration files.
The model name is how clients identify which deployment should handle their request. For body-based routing, the model field in the request body must match this name. For path-based routing, the model name is the first segment of the URL path.

Inference runtimes

The inference runtime is the engine that loads your model weights and serves requests. CoreWeave manages the inference runtime. The runtime field configures which engine and version to use. You must specify an engine when you create a deployment.

Supported engines

EngineDescription
vllmInference engine for large language models. Supports continuous batching, paged attention, and tensor parallelism.

Runtime configuration

FieldRequiredDescription
engineYesThe inference engine. Supported value: vllm.
versionNoThe engine version in semantic versioning (SemVer) format. Defaults to the latest version if not set.
engineConfigNoA map of engine-specific configuration options as key-value pairs.
Query the available runtime versions and configuration options from the deployment parameters endpoint:
curl "${CW_BASE_URL}/v1alpha1/inference/deployments/parameters" \
  -H "Authorization: Bearer ${CW_API_TOKEN}"

Engine configuration options

The engineConfig field accepts engine-specific key-value pairs that control model serving behavior. Supported keys for vllm:
KeyDescription
max-model-lenMaximum sequence length (context window) the model can handle. Reduce this to decrease memory usage for workloads with shorter sequences.
reasoning-parserParser for structured reasoning output from thinking models. Set to the parser matching your model (for example, qwen3, deepseek_r1).
enable-auto-tool-choiceEnables automatic tool selection for tool-calling models. Set the value to an empty string to enable.
tool-call-parserParser for tool call output format. Set to the parser matching your model (for example, llama3_json, qwen3_coder).
max-num-batched-tokensMaximum number of tokens processed in a single batch. Controls the throughput and latency tradeoff.
structured-outputs-config.backendBackend for structured output generation. Options: guidance, xgrammar.
The deployment parameters endpoint returns the full list of allowed configuration keys for each engine under runtimeParameters.runtimeConfigOptions.

GPU selection

Deployments run on dedicated GPU infrastructure. The resources field configures the hardware allocation that each replica uses to serve your model.
FieldRequiredDescription
instanceTypeYesThe GPU instance type. Query available types from the deployment parameters endpoint.
gpuCountYesThe number of GPUs allocated to each replica. Allowed values: 1, 2, 4, 8, 16. CoreWeave automatically assigns CPU and RAM based on GPU count.
Query the available instance types from the deployment parameters endpoint:
curl "${CW_BASE_URL}/v1alpha1/inference/deployments/parameters" \
  -H "Authorization: Bearer ${CW_API_TOKEN}"
The response includes the available instance types under resourceParameters.instanceTypes.

Choose an instance type

Match the instance type to your model’s requirements:
  • Model size: Choose a GPU with enough memory to fit your model weights and the inference runtime’s working memory (KV cache).
  • Throughput: Higher-end GPUs (H200, B200) provide more compute and memory bandwidth for faster inference.
  • Multi-GPU: For models that exceed a single GPU’s memory, increase gpuCount to allocate additional GPUs per replica.

Traffic weights

When multiple deployments serve the same model name on the same gateway, you can split inference requests between them to support canary releases and A/B testing. The traffic field controls how CoreWeave distributes requests across those deployments.
FieldDescription
weightAn integer from 0 to 1000. CoreWeave normalizes weights across all deployments with the same model name into percentages.
Traffic weights enable canary deployments and A/B testing. For example, if deployment A has weight: 900 and deployment B has weight: 100, deployment A receives 90% of traffic and deployment B receives 10%. Setting weight to 0 stops traffic to a deployment without deleting it.

Disable a deployment

Set the disabled field to true to stop a deployment from serving traffic without deleting it. A disabled deployment retains its configuration. To re-enable it, set disabled to false.

Deployment lifecycle

A deployment reports its current state through the status field so you can track provisioning, updates, and failures. Deployments go through the following states:
StatusDescription
STATUS_CREATINGCoreWeave is provisioning the deployment and loading model weights.
STATUS_READYThe deployment is ready and serving requests.
STATUS_UPDATINGCoreWeave is updating the deployment configuration. Existing replicas continue serving traffic during updates.
STATUS_DELETINGCoreWeave is removing the deployment. Existing requests have a grace period to complete.
STATUS_ERRORThe deployment encountered an error. Check status.conditions for details.
STATUS_FAILEDThe deployment failed to start. Check status.conditions for details.

Manage deployments

Manage deployments through the CoreWeave Inference API. For per-operation request and response schemas, see the DeploymentService pages in the API reference.
Update requests (PATCH) require the complete deployment specification, not just the fields you’re changing. Omitted fields revert to their defaults.
For a step-by-step walkthrough, see the Getting started guide.
Last modified on June 10, 2026