This guide walks you through deploying your first model on CoreWeave Dedicated Inference, the Bring Your Own Weights (BYOW) option for serving models on dedicated GPU infrastructure. By the end, you have a running inference endpoint that responds to prompts using the OpenAI-compatible API. For an overview of CoreWeave’s inference offerings, see the Inference introduction.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before you begin, verify that you have the following:- A CoreWeave account with Inference access enabled.
- A CoreWeave API access token with the Inference Admin role.
- Model weights uploaded to a CoreWeave Object Storage bucket. Dedicated Inference uses a bring-your-own-weights (BYOW) model. Download model weights from your model provider and upload them to Object Storage before starting.
curlor another HTTP client for making API requests.
Set your API token
Set your API token as an environment variable so that subsequent commands can authenticate with the CoreWeave API. Replace[API-TOKEN] with your token. For details on creating a token, see Manage API access tokens.
Grant inference access to your bucket
Dedicated Inference reads model weights from CoreWeave Object Storage using a dedicated service account. Attach the following bucket policy to your weights bucket so the service can list and read its contents. Replace[BUCKET-NAME] with the name of the bucket containing your model weights.
inference-bucket-policy.json) and apply it with the AWS CLI. If you have configured a named AWS CLI profile for CoreWeave (for example, cw), pass it with --profile [PROFILE-NAME]. Omit --profile if your CoreWeave credentials are in the default profile.
The
Principal value is the canonical CoreWeave Inference service account and is the same for all customers. A missing bucket policy is the most common cause of deployments that fail to load weights. For tooling alternatives (s3cmd, Boto3, Terraform), how to scope access to a specific path prefix, and notes on the Cloud Console, see Grant inference access to your bucket.Create a gateway
A gateway provides the external endpoint that routes traffic to your model deployments. It handles authentication and load balancing. First, query the available zones:If the request returns
{"code":7, "message":"organization is not allowed to perform this operation"}, your organization isn’t enabled for Dedicated Inference yet. Contact your CoreWeave representative or CoreWeave support with your organization ID to request access.[ZONE-NAME] with a zone from the response.
This example creates a gateway with CoreWeave IAM authentication and body-based routing, which routes requests based on the model field in the request body. Body-based routing is the default and is compatible with OpenAI API conventions.
status.status and endpoints fields appear on subsequent GET requests once the gateway is provisioned.
gateway.spec.id for the next step. You need this ID to associate deployments with the gateway.
The gateway may take a few moments to become ready. You can check its status with a
GET request to /v1alpha1/inference/gateways/{id}.Create a deployment
A deployment configures a model serving instance with your chosen runtime, GPU type, and model weights. First, query the available instance types and runtime versions:[GATEWAY-ID]: The gateway ID from the previous step.[INSTANCE-TYPE]: An instance type from the parameters response.[MODEL-NAME]: A name for your model (4-63 characters). The gateway uses this name to route inference requests to this deployment.[BUCKET-NAME]: The CoreWeave Object Storage bucket containing the model weights.[MODEL-PATH]: The path within the bucket to the model directory.
status.status field appears when you poll the deployment with a GET request.
deployment.spec.id:
Wait for the deployment to start
After creation, the deployment loads model weights and starts the inference engine. Poll the deployment status until it reachesSTATUS_READY. This typically takes several minutes.
deployment.status.status in the response. Continue polling until you see STATUS_READY:
STATUS_READY, fails with STATUS_ERROR or STATUS_FAILED, or 15 minutes elapse:
jq to parse the response. Install jq with your package manager (for example, brew install jq on macOS or apt install jq on Debian/Ubuntu).
Once the deployment is running, retrieve the gateway endpoint URL:
gateway.status.endpoints field contains an array of endpoint URLs for inference requests. The first entry is the primary endpoint:
The gateway’s public DNS record and TLS certificate provision asynchronously after the deployment reaches
STATUS_READY and can take several minutes to fully resolve. If your first inference request fails with an SSL handshake error or DNS resolution failure, wait a few minutes and retry.Send an inference request
The gateway exposes an OpenAI-compatible API. With body-based routing, requests are routed based on themodel field in the request body. Send a chat completion request using the model name from your deployment:
coreWeaveAuth.
The response is an OpenAI-compatible chat completion. Beyond the standard fields, vLLM also returns several engine-specific fields (token_ids, prompt_logprobs, kv_transfer_params, and others). These are typically null for normal requests and can be ignored.
openai library:
Update a deployment
To change a deployment’s configuration, send aPATCH request with the complete deployment specification. All fields are required, not just the ones being changed, because omitted fields revert to their defaults.
This example increases the autoscaling maximum from 1 to 4:
Clean up
When you no longer need the inference endpoint, delete the resources to stop incurring charges. You must delete deployments before their associated gateway.Next steps
Explore these resources to learn more about CoreWeave Inference.- Gateways: Configure authentication, routing strategies, and traffic splitting.
- Models and deployments: Learn about runtime configuration, GPU selection, and deployment options.
- Scaling: Configure autoscaling and reserve GPU capacity.
- Billing: Understand pricing and optimize inference costs.
- Inference API reference: Explore the full API surface.