Skip to main content
This guide walks you through deploying your first model on CoreWeave Dedicated Inference, the Bring Your Own Weights (BYOW) option for serving models on dedicated GPU infrastructure. It targets developers and ML engineers who want to serve their own model weights on managed GPU instances without operating the underlying serving stack. By the end, you have a running inference endpoint that responds to prompts through the OpenAI-compatible API. You can then integrate the endpoint into applications that already target OpenAI-style chat completions. For an overview of CoreWeave’s inference offerings, see the Inference introduction.
CoreWeave Dedicated Inference is available as a private preview. To request access, contact your CoreWeave representative.

Prerequisites

Before you begin, verify that you have the following:
  • A CoreWeave account with Inference access enabled.
  • A CoreWeave API access token with the Inference Admin role.
  • Model weights uploaded to a CoreWeave Object Storage bucket. Dedicated Inference uses a bring-your-own-weights (BYOW) model. Download model weights from your model provider and upload them to Object Storage before starting.
  • curl or another HTTP client for making API requests.
Want to interact with the API programmatically? See the Inference API reference for the REST, gRPC, and Connect interfaces, or install a generated client from the Inference SDKs.

Set your API token

Set your API token as an environment variable so that subsequent commands can authenticate with the CoreWeave API. Replace [API-TOKEN] with your token. For details about creating a token, see Manage API access tokens.
export CW_API_TOKEN="[API-TOKEN]"
export CW_BASE_URL="https://api.coreweave.com"

Grant inference access to your bucket

Dedicated Inference reads model weights from CoreWeave Object Storage using a dedicated service account. Attach the following bucket policy to your weights bucket so the service can list and read its contents. Replace [BUCKET-NAME] with the name of the bucket containing your model weights.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowInference",
      "Effect": "Allow",
      "Principal": {
        "CW": [
          "arn:aws:iam::cw4637:coreweave/uvAGGQSxxXeeQBJzcGsD9"
        ]
      },
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::[BUCKET-NAME]",
        "arn:aws:s3:::[BUCKET-NAME]/*"
      ]
    }
  ]
}
Save the policy to a file (for example, inference-bucket-policy.json) and apply it with the AWS CLI. If you have configured a named AWS CLI profile for CoreWeave (for example, cw), pass it with --profile [PROFILE-NAME]. Omit --profile if your CoreWeave credentials are in the default profile.
This bucket policy for inference has the potential to override your existing permissions and bucket policy. This put-bucket-policy command is a replace, not a merge.
aws s3api put-bucket-policy \
  --bucket [BUCKET-NAME] \
  --policy file://inference-bucket-policy.json \
  --endpoint-url https://cwobject.com \
  --profile [PROFILE-NAME]
The Principal value is the canonical CoreWeave Inference service account and is the same for all customers. A missing bucket policy is the most common cause of deployments that fail to load weights. For tooling alternatives (s3cmd, Boto3, Terraform), how to scope access to a specific path prefix, and notes on the Cloud Console, see Grant inference access to your bucket.

Create a gateway

With the bucket policy in place, provision the gateway. A gateway provides the external endpoint that routes traffic to your model deployments. It handles authentication and load balancing. First, query the available zones so you can place the gateway in a region where Dedicated Inference capacity is offered:
curl "${CW_BASE_URL}/v1alpha1/inference/gateways/parameters" \
  -H "Authorization: Bearer ${CW_API_TOKEN}"
If the request returns {"code":7, "message":"organization is not allowed to perform this operation"}, your organization isn’t enabled for Dedicated Inference yet. Contact your CoreWeave representative or CoreWeave support with your organization ID to request access.
The response lists all available zones:
{"zones": ["RNO2A", "US-EAST-01A", "US-EAST-02A", "US-EAST-04A", "US-EAST-04B", "US-EAST-06A", "US-EAST-08A", "US-EAST-13A", "US-EAST-14A", "US-WEST-01A", "US-WEST-04A", "US-WEST-09B"]}
Then create a gateway in one of the available zones listed in the previous step. Replace [ZONE-NAME] with a zone from the response. The gateway name must be a valid hostname label: letters, digits, and hyphens only, starting and ending with a letter or digit, and no more than 63 characters. Dots are not allowed. If validation fails, you receive validation error: name: must be a valid hostname label. This example creates a gateway with CoreWeave IAM authentication and body-based routing, which routes requests based on the model field in the request body. Body-based routing is the default and follows OpenAI API conventions.
curl -X POST "${CW_BASE_URL}/v1alpha1/inference/gateways" \
  -H "Authorization: Bearer ${CW_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-first-gateway",
    "zones": ["[ZONE-NAME]"],
    "coreWeaveAuth": {},
    "bodyBasedRouting": {
      "apiType": "API_TYPE_OPENAI"
    }
  }'
The response includes the gateway specification and creation timestamps. The status.status and endpoints fields appear on subsequent GET requests once the gateway is provisioned.
{
  "gateway": {
    "spec": {
      "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "name": "my-first-gateway",
      "organizationId": "[ORG-UID]",
      "zones": ["US-EAST-04A"],
      "coreWeaveAuth": {},
      "bodyBasedRouting": {
        "apiType": "API_TYPE_OPENAI"
      }
    },
    "status": {
      "createdAt": "2026-04-14T12:00:00Z",
      "updatedAt": "2026-04-14T12:00:00Z"
    }
  }
}
Save the gateway ID from gateway.spec.id for the next step. You need this ID to associate deployments with the gateway.
export CW_GATEWAY_ID="[GATEWAY-ID]"
You now have a gateway that can accept inference traffic but doesn’t yet route to any model. The next section attaches a deployment to it.
The gateway may take a few moments to become ready. You can check its status with a GET request to /v1alpha1/inference/gateways/{id}.

View all gateways

You can view all existing gateways with the following command. Use it to confirm your gateway was created correctly. Parse the output with jq to make the response more readable.
curl "${CW_BASE_URL}/v1alpha1/inference/gateways" -H "Authorization: Bearer ${CW_API_TOKEN}"

curl "${CW_BASE_URL}/v1alpha1/inference/gateways" -H "Authorization: Bearer ${CW_API_TOKEN}" | jq .

Create a deployment

A deployment configures a model serving instance with your chosen runtime, GPU type, and model weights. Attaching the deployment to the gateway you just created makes the model reachable through the gateway endpoint. First, query the available instance types and runtime versions so you can pick a GPU type and runtime that match your model:
curl "${CW_BASE_URL}/v1alpha1/inference/deployments/parameters" \
  -H "Authorization: Bearer ${CW_API_TOKEN}"
Then create a deployment that references your gateway. Replace the placeholder values:
  • [GATEWAY-ID]: The gateway ID from the previous step (creating the gateway, or the inference/gateways list command).
  • [INSTANCE-TYPE]: An instance type from the inference/deployments/parameters response in the previous step. For details on each type, see GPU instances.
  • [MODEL-NAME]: A name for your model (from 4 to 63 characters). The gateway uses this name to route inference requests to this deployment.
  • [BUCKET-NAME]: The CoreWeave Object Storage bucket containing the model weights.
  • [MODEL-PATH]: The path within the bucket to the model directory.
  • [ENGINE-VERSION]: The vllm runtime version to serve your model with. The inference/deployments/parameters response (above) returns the available versions.
The deployment’s top-level name field ("my-first-deployment" in the example) must be a valid hostname label: letters, digits, and hyphens only, starting and ending with a letter or digit, and no more than 63 characters. Dots are not allowed. The model’s name field ([MODEL-NAME]) follows the same hostname label rule. If the deployment name fails validation, you receive validation error: name: must be a valid hostname label.
An S3 path breaks down into [BUCKET-NAME] and [MODEL-PATH]. For example, s3://test-bucket/raw/Qwen/Qwen3.5-0.8B/2fc06364715b967f1860aea9cf38778875588b17 breaks down into:
[BUCKET-NAME]  test-bucket
[MODEL-PATH]   raw/Qwen/Qwen3.5-0.8B/2fc06364715b967f1860aea9cf38778875588b17
If you receive {"code":3, "message":"model path has no objects: path ..."}, the model path is invalid or inaccessible. If you receive {"code":3, "message":"model bucket is not accessible..."}, the bucket is invalid or inaccessible.
curl -X POST "${CW_BASE_URL}/v1alpha1/inference/deployments" \
  -H "Authorization: Bearer ${CW_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-first-deployment",
    "gatewayIds": ["[GATEWAY-ID]"],
    "runtime": {
      "engine": "vllm",
      "version": "[ENGINE-VERSION]"
    },
    "resources": {
      "instanceType": "[INSTANCE-TYPE]",
      "gpuCount": 1
    },
    "model": {
      "name": "[MODEL-NAME]",
      "bucket": "[BUCKET-NAME]",
      "path": "[MODEL-PATH]"
    },
    "autoscaling": {
      "min": 1,
      "max": 1
    },
    "traffic": {
      "weight": 100
    }
  }'
The response includes the deployment ID and creation timestamps. The status.status field appears when you poll the deployment with a GET request.
{
  "deployment": {
    "spec": {
      "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
      "name": "my-first-deployment",
      "organizationId": "[ORG-UID]",
      "gatewayIds": ["a1b2c3d4-e5f6-7890-abcd-ef1234567890"],
      "runtime": { "engine": "vllm", "version": "[ENGINE-VERSION]" },
      "resources": { "instanceType": "gd-8xh100ib-i128", "gpuCount": 1 },
      "model": { "name": "my-model", "bucket": "my-bucket", "path": "models/my-model/" },
      "autoscaling": { "min": 1, "max": 1 },
      "traffic": { "weight": 100 }
    },
    "status": {
      "createdAt": "2026-04-14T12:10:00Z",
      "updatedAt": "2026-04-14T12:10:00Z"
    }
  }
}
Save the deployment ID from deployment.spec.id so subsequent commands can reference the deployment:
export CW_DEPLOYMENT_ID="[DEPLOYMENT-ID]"
The deployment is registered, but the inference engine isn’t yet serving traffic. The next section covers monitoring it until it becomes ready.

View all deployments

You can view all existing deployments with the following command. Use it to confirm your deployment was created correctly. Parse the output with jq to make the response more readable.
curl "${CW_BASE_URL}/v1alpha1/inference/deployments" -H "Authorization: Bearer ${CW_API_TOKEN}"

curl "${CW_BASE_URL}/v1alpha1/inference/deployments" -H "Authorization: Bearer ${CW_API_TOKEN}" | jq .

Wait for the deployment to start

After creation, the deployment loads model weights and starts the inference engine. Poll the deployment status until it reaches STATUS_READY. This typically takes several minutes.
curl "${CW_BASE_URL}/v1alpha1/inference/deployments/${CW_DEPLOYMENT_ID}" \
  -H "Authorization: Bearer ${CW_API_TOKEN}"
Check deployment.status.status in the response. Continue polling until you see STATUS_READY:
{
  "deployment": {
    "status": {
      "status": "STATUS_READY"
    }
  }
}
To poll automatically, use this loop. It exits when the deployment reaches STATUS_READY, fails with STATUS_ERROR or STATUS_FAILED, or reaches the 15-minute timeout:
deadline=$(( $(date +%s) + 900 ))
while [ "$(date +%s)" -lt "$deadline" ]; do
  status_deployment=$(curl -sS "${CW_BASE_URL}/v1alpha1/inference/deployments/${CW_DEPLOYMENT_ID}" \
    -H "Authorization: Bearer ${CW_API_TOKEN}" \
    | jq -r '.deployment.status.status // "UNKNOWN"')
  echo "Status: $status_deployment"
  case "$status_deployment" in
    STATUS_READY) echo "Deployment is ready."; break ;;
    STATUS_ERROR|STATUS_FAILED) echo "Deployment failed."; exit 1 ;;
  esac
  sleep 30
done
This loop uses jq to parse the response. Install jq with your package manager (for example, brew install jq on macOS or apt install jq on Debian or Ubuntu). Once the deployment is running, retrieve the gateway endpoint URL:
curl "${CW_BASE_URL}/v1alpha1/inference/gateways/${CW_GATEWAY_ID}" \
  -H "Authorization: Bearer ${CW_API_TOKEN}"
The gateway.status.endpoints field contains an array of endpoint URLs for inference requests. The first entry is the primary endpoint:
{
  "gateway": {
    "status": {
      "status": "STATUS_READY",
      "endpoints": ["https://my-first-gateway.[ORG-UID].gw.cwinference.com"]
    }
  }
}
Export the endpoint for the next step:
export CW_GATEWAY_ENDPOINT="[GATEWAY-ENDPOINT]"
The gateway’s public DNS record and TLS certificate provision asynchronously after the deployment reaches STATUS_READY and can take several minutes to resolve. If your first inference request fails with an SSL handshake error or DNS resolution failure, wait a few minutes and retry.

Send an inference request

With the deployment ready and the gateway endpoint exported, you can now send your first inference request to verify the end-to-end path. The gateway exposes an OpenAI-compatible API. With body-based routing, the gateway routes requests based on the model field in the request body. Send a chat completion request using the model name from your deployment:
curl -X POST "${CW_GATEWAY_ENDPOINT}/v1/chat/completions" \
  -H "Authorization: Bearer ${CW_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "[MODEL-NAME]",
    "messages": [
      {
        "role": "user",
        "content": "What is CoreWeave?"
      }
    ],
    "max_tokens": 256
  }'
The same CoreWeave API access token used for the management API authenticates inference requests when the gateway uses coreWeaveAuth. The response is an OpenAI-compatible chat completion. Beyond the standard fields, vLLM also returns several engine-specific fields (token_ids, prompt_logprobs, kv_transfer_params, and others). These are typically null for normal requests, so you can ignore them.
{
  "id": "chatcmpl-bcbcbae71847bf87",
  "object": "chat.completion",
  "created": 1777477750,
  "model": "[MODEL-NAME]",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "CoreWeave is a cloud infrastructure provider...",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 24,
    "total_tokens": 104,
    "completion_tokens": 80,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}
A successful response confirms that your gateway and deployment are working. Alternatively, use the Python openai library:
from openai import OpenAI

client = OpenAI(
    base_url="[GATEWAY-ENDPOINT]/v1/",
    api_key="[API-TOKEN]",
)

response = client.chat.completions.create(
    model="[MODEL-NAME]",
    messages=[
        {"role": "user", "content": "What is CoreWeave?"},
    ],
    max_tokens=256,
)
print(response.choices[0].message.content)

Update a deployment

Over time, you may need to adjust capacity, swap the GPU type, or point the deployment at new model weights. To change a deployment’s configuration, send a PATCH request with the complete deployment specification. All fields are required, not just the ones you change. Omitting a field either fails validation (for required fields) or reverts it to its default, so start from the command you used to create the deployment and change only the fields you want to update. This example increases the autoscaling maximum from 1 to 4:
When you update a deployment, any patch that changes routing (for example, model.name) forces the route to be recreated. The old route becomes unavailable immediately and stays down until the new route is ready.
curl -X PATCH "${CW_BASE_URL}/v1alpha1/inference/deployments/${CW_DEPLOYMENT_ID}" \
  -H "Authorization: Bearer ${CW_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "id": "[DEPLOYMENT-ID]",
    "name": "my-first-deployment",
    "gatewayIds": ["[GATEWAY-ID]"],
    "runtime": {
      "engine": "vllm",
      "version": "[ENGINE-VERSION]"
    },
    "resources": {
      "instanceType": "[INSTANCE-TYPE]",
      "gpuCount": 1
    },
    "model": {
      "name": "[MODEL-NAME]",
      "bucket": "[BUCKET-NAME]",
      "path": "[MODEL-PATH]"
    },
    "autoscaling": {
      "min": 1,
      "max": 4
    },
    "traffic": {
      "weight": 100
    }
  }'

Observability

To view logs from your deployment:
  1. Go to console.coreweave.com and open Grafana under the Observability section.
  2. In Grafana, click Explore in the left nav, then select CoreWeave Logs from the data source dropdown.
  3. In the query builder, click Code on the right, then enter the following query:
    {cluster="cwinference", container="vllm-server"}
    
  4. Click Run query in the top-right corner to see the logs.

Clean up

When you no longer need the inference endpoint, delete the resources to stop incurring charges. You must delete deployments before their associated gateway.
curl -X DELETE "${CW_BASE_URL}/v1alpha1/inference/deployments/${CW_DEPLOYMENT_ID}" \
  -H "Authorization: Bearer ${CW_API_TOKEN}"
curl -X DELETE "${CW_BASE_URL}/v1alpha1/inference/gateways/${CW_GATEWAY_ID}" \
  -H "Authorization: Bearer ${CW_API_TOKEN}"
After both delete requests succeed, CoreWeave removes all inference resources from this guide and no further charges accrue.

Next steps

You now have a working baseline: a gateway, a deployment serving a model, and a successful inference response. Explore these resources to learn more about CoreWeave Inference and to take the deployment beyond a single-replica baseline.
  • Gateways: Configure authentication, routing strategies, and traffic splitting.
  • Models and deployments: Learn about runtime configuration, GPU selection, and deployment options.
  • Scaling: Configure autoscaling and reserve GPU capacity.
  • Billing: Understand pricing and optimize inference costs.
  • Inference API reference: Explore the full API surface.
Last modified on June 18, 2026