About Dedicated Inference - CoreWeave Docs

Dedicated Inference is the bring-your-own-weights option in the CoreWeave Inference family. You upload your model artifacts to CoreWeave Object Storage, choose a GPU type and inference runtime, and CoreWeave handles cluster operations, deployment, scaling, routing, and lifecycle management. Unlike Serverless Inference, which serves models from a CoreWeave-managed catalog, Dedicated Inference deploys the weights you provide on dedicated GPU resources. To deploy your first model, see Getting started with Inference.

Dedicated Inference key capabilities

On top of the capabilities common to all CoreWeave Inference options, Dedicated Inference gives you:

Bring Your Own Weights (BYOW): Upload custom model weights to CoreWeave Object Storage and deploy them without operating the underlying clusters.
Inference runtime selection: Choose a supported inference runtime (vllm) with version pinning.
Gateways with traffic management: Configure authentication, body-based or path-based routing, and load balancing across multiple deployments behind a single endpoint.
Capacity claims: Reserve GPU capacity for your inference workloads.
Zone selection: Deploy gateways into specific CoreWeave Availability Zones to optimize for latency, data locality, or compliance requirements.
Autoscaling: Configure minimum and maximum replicas with concurrency targets.

Core resources

Dedicated Inference is built around three resource types you create and manage through the Inference API:

Gateways provide routable endpoints that handle authentication, load balancing, and traffic routing to your model deployments. When you create a gateway, you select a CoreWeave Availability Zone and a routing mode (body-based, header-based, or path-based). Each gateway exposes an external-facing API that your applications use to access your models.
Deployments configure model serving instances, including the inference runtime, GPU type, model weights location in Object Storage, and autoscaling parameters. Each deployment runs your model on dedicated GPU infrastructure and attaches to one or more gateways.
Capacity claims manage hardware resource reservations and provide reserved GPU capacity for your inference workloads independent of any single deployment.

Manage endpoints

After you create your initial resources, you can manage the full lifecycle of your inference endpoints:

Create: Deploy a model endpoint by creating a gateway and one or more deployments. Select your GPU type, inference runtime, model weights location, and scaling parameters.
Update: Modify deployment configuration (scaling parameters, GPU type, or model weights) by sending a PATCH with the full updated specification. CoreWeave applies updates with rolling rollout where possible.
Delete: Remove deployments and gateways to stop serving and release the associated resources. Delete deployments before their parent gateway.

Monitoring

CoreWeave provides a Grafana dashboard to monitor inference usage and operational metrics. Use it to track request throughput, latency, GPU utilization, and endpoint health across your deployments. For access instructions, see Introduction to CoreWeave Grafana.

Pricing

Dedicated Inference bills either by node usage or by GPU usage:

Node-based billing: For customers with existing reserved nodes, you can redirect reserved capacity to the inference platform at your existing rates.
GPU-based billing: For on-demand workloads, you pay per GPU-hour. Choose the GPU type and view current rates on the CoreWeave pricing page.

For details on autoscaling, see Scaling. For pricing and cost-control patterns, see Billing.

​Dedicated Inference key capabilities

​Core resources

​Manage endpoints

​Monitoring

​Pricing

Dedicated Inference key capabilities

Core resources

Manage endpoints

Monitoring

Pricing