Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

A deployment in CoreWeave Inference configures a model serving instance that runs your model on dedicated GPU infrastructure. Each deployment specifies the model weights, inference runtime, GPU type, and scaling parameters. This page covers how to configure model weights, select an inference runtime, choose GPU resources, and manage traffic across deployments.

Bring Your Own Weights

CoreWeave Inference uses a Bring Your Own Weights (BYOW) model. You provide the model artifacts, and CoreWeave handles the infrastructure required to serve them. Model weights must be stored in a CoreWeave Object Storage bucket. When you create a deployment, you must specify the bucket name and the path to the model directory within the bucket. When a deployment starts, CoreWeave loads your model weights onto the selected GPU infrastructure and serves requests through the associated gateway.
CoreWeave Inference doesn’t pull model weights directly from external sources such as Hugging Face. Download your model weights and upload them to a CoreWeave Object Storage bucket before creating a deployment.

Grant inference access to your bucket

Dedicated Inference reads your model weights from CoreWeave Object Storage using a dedicated service account. Attach the following bucket policy to your weights bucket so the service can list and read its contents. Replace [BUCKET-NAME] with the name of the bucket containing your model weights.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowInference",
      "Effect": "Allow",
      "Principal": {
        "CW": [
          "arn:aws:iam::cw4637:coreweave/uvAGGQSxxXeeQBJzcGsD9"
        ]
      },
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::[BUCKET-NAME]",
        "arn:aws:s3:::[BUCKET-NAME]/*"
      ]
    }
  ]
}
If you store models only under a specific prefix, you can narrow object access to that prefix:
"Resource": [
  "arn:aws:s3:::[BUCKET-NAME]",
  "arn:aws:s3:::[BUCKET-NAME]/path/to/models/*"
]
The bucket-level ARN (arn:aws:s3:::[BUCKET-NAME]) must remain as-is for s3:ListBucket. Save the policy to a file (for example, inference-bucket-policy.json) and apply it with the AWS CLI. If you have configured a named AWS CLI profile for CoreWeave, pass it with --profile [PROFILE-NAME]. Omit the flag if your CoreWeave credentials are in the default profile.
aws s3api put-bucket-policy \
  --bucket [BUCKET-NAME] \
  --policy file://inference-bucket-policy.json \
  --endpoint-url https://cwobject.com \
  --profile [PROFILE-NAME]
For other tools (s3cmd, Boto3, Terraform) and the full reference, see Manage bucket policies.
The Cloud Console doesn’t support setting bucket policies. Apply the policy with the AWS CLI, an S3 SDK, or Terraform. Organization access policies in the Console are a different mechanism and won’t grant the inference service account access to your bucket.
The Principal value in the preceding policy is the canonical CoreWeave Inference service account and is the same for all customers. If a deployment fails to load weights with a 520 or timeout despite reaching STATUS_READY, a missing bucket policy is the most common cause.
Once the policy is in place, deployments can read model weights from the bucket when they start.

Model configuration

The model field in a deployment specifies where to find the model weights and what name to use for routing:
FieldRequiredDescription
nameYesThe model name used to route inference requests and listed in the /models endpoint. Must be 4-63 characters.
bucketYesThe CoreWeave Object Storage bucket containing the model weights.
pathYesThe path within the bucket to the model and configuration files.
The model name is how clients identify which deployment should handle their request. For body-based routing, the model field in the request body must match this name. For path-based routing, the model name is the first segment of the URL path.

Inference runtimes

CoreWeave manages the inference runtime. The runtime field configures which engine and version to use. You must specify an engine when creating a deployment.

Supported engines

EngineDescription
vllmInference engine for large language models. Supports continuous batching, paged attention, and tensor parallelism.

Runtime configuration

FieldRequiredDescription
engineYesThe inference engine. Supported value: vllm.
versionNoThe engine version in semantic versioning (SemVer) format. Defaults to the latest version if not set.
engineConfigNoA map of engine-specific configuration options as key-value pairs.
Query the available runtime versions and configuration options from the deployment parameters endpoint:
curl "${CW_BASE_URL}/v1alpha1/inference/deployments/parameters" \
  -H "Authorization: Bearer ${CW_API_TOKEN}"

Engine configuration options

The engineConfig field accepts engine-specific key-value pairs that control model serving behavior. Supported keys for vllm:
KeyDescription
max-model-lenMaximum sequence length (context window) the model can handle. Reduce this to decrease memory usage for workloads with shorter sequences.
reasoning-parserParser for structured reasoning output from thinking models. Set to the parser matching your model (for example, qwen3, deepseek_r1).
enable-auto-tool-choiceEnables automatic tool selection for tool-calling models. Set the value to an empty string to enable.
tool-call-parserParser for tool call output format. Set to the parser matching your model (for example, llama3_json, qwen3_coder).
max-num-batched-tokensMaximum number of tokens processed in a single batch. Controls the throughput and latency tradeoff.
structured-outputs-config.backendBackend for structured output generation. Options: guidance, xgrammar.
The deployment parameters endpoint returns the full list of allowed configuration keys for each engine under runtimeParameters.runtimeConfigOptions.

GPU selection

Deployments run on dedicated GPU infrastructure. The resources field configures the hardware allocation.
FieldRequiredDescription
instanceTypeYesThe GPU instance type. Query available types from the deployment parameters endpoint.
gpuCountYesThe number of GPUs allocated to each replica. Allowed values: 1, 2, 4, 8, 16. CPU and RAM are automatically assigned based on GPU count.
Query the available instance types from the deployment parameters endpoint:
curl "${CW_BASE_URL}/v1alpha1/inference/deployments/parameters" \
  -H "Authorization: Bearer ${CW_API_TOKEN}"
The response includes the available instance types under resourceParameters.instanceTypes.

Choose an instance type

Match the instance type to your model’s requirements:
  • Model size: Choose a GPU with enough memory to fit your model weights and the inference runtime’s working memory (KV cache).
  • Throughput: Higher-end GPUs (H200, B200) provide more compute and memory bandwidth for faster inference.
  • Multi-GPU: For models that exceed a single GPU’s memory, increase gpuCount to allocate additional GPUs per replica.

Traffic weights

The traffic field controls how traffic is distributed when multiple deployments share the same model name on the same gateway.
FieldDescription
weightAn integer from 0 to 1000. Weights across all deployments with the same model name are normalized into percentages.
Traffic weights enable canary deployments and A/B testing. For example, if deployment A has weight: 900 and deployment B has weight: 100, deployment A receives 90% of traffic and deployment B receives 10%. Setting weight to 0 stops traffic to a deployment without deleting it.

Disable a deployment

Set the disabled field to true to stop a deployment from serving traffic without deleting it. A disabled deployment retains its configuration. To re-enable it, set disabled to false.

Deployment lifecycle

Deployments go through the following states:
StatusDescription
STATUS_CREATINGThe deployment is being provisioned and model weights are loading.
STATUS_READYThe deployment is ready and serving requests.
STATUS_UPDATINGThe deployment configuration is being updated. Existing replicas continue serving traffic during updates.
STATUS_DELETINGThe deployment is being removed. Existing requests are given a grace period to complete.
STATUS_ERRORThe deployment encountered an error. Check status.conditions for details.
STATUS_FAILEDThe deployment failed to start. Check status.conditions for details.

Manage deployments

Manage deployments through the CoreWeave Inference API. For per-operation request and response schemas, see the DeploymentService pages in the API reference.
Update requests (PATCH) require the complete deployment specification, not just the fields being changed. Omitted fields revert to their defaults.
For a step-by-step walkthrough, see the Getting started guide.
Last modified on May 6, 2026