Dedicated Inference is the bring-your-own-weights option in the CoreWeave Inference family. You upload your model artifacts to CoreWeave Object Storage, choose a GPU type and inference runtime, and CoreWeave handles cluster operations, deployment, scaling, routing, and lifecycle management. Unlike Serverless Inference, which serves models from a CoreWeave-managed catalog, Dedicated Inference deploys your weights on dedicated GPU resources. To deploy your first model, see Getting started with Inference.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Dedicated Inference key capabilities
On top of the capabilities common to all CoreWeave Inference options, Dedicated Inference gives you:- Bring Your Own Weights (BYOW): Upload custom model weights to CoreWeave Object Storage and deploy them without operating the underlying clusters.
- Inference runtime selection: Choose a supported inference runtime (currently
vllm) with version pinning. - Gateways with traffic management: Configure authentication, body-based or path-based routing, and load balancing across multiple deployments behind a single endpoint.
- Capacity claims: Reserve guaranteed GPU capacity for predictable performance and cost.
- Zone selection: Deploy gateways into specific CoreWeave Availability Zones to optimize for latency, data locality, or compliance requirements.
- Autoscaling: Configure minimum and maximum replicas with concurrency targets.
Core resources
Dedicated Inference is built around three resource types you create and manage through the Inference API:- Gateways provide routable endpoints that handle authentication, load balancing, and traffic routing to your model deployments. When you create a gateway, you select a CoreWeave Availability Zone and a routing mode (body-based, header-based, or path-based). Each gateway exposes an external-facing API that your applications use to access your models.
- Deployments configure model serving instances, including the inference runtime, GPU type, model weights location in Object Storage, and autoscaling parameters. Each deployment runs your model on dedicated GPU infrastructure and is associated with one or more gateways.
- Capacity claims manage hardware resource reservations, providing guaranteed GPU capacity for your inference workloads independent of any single deployment.
Manage endpoints
After you create your initial resources, you can manage the full lifecycle of your inference endpoints:- Create: Deploy a model endpoint by creating a gateway and one or more deployments. Select your GPU type, inference runtime, model weights location, and scaling parameters.
- Update: Modify deployment configuration (scaling parameters, GPU type, or model weights) by sending a
PATCHwith the full updated specification. Updates are applied with rolling rollout where possible. - Delete: Remove deployments and gateways to stop serving and release the associated resources. Delete deployments before their parent gateway.
Monitoring
CoreWeave provides a Grafana dashboard for monitoring inference usage and operational metrics. Use it to track request throughput, latency, GPU utilization, and endpoint health across your deployments. For access instructions, see Introduction to CoreWeave Grafana.Pricing
Dedicated Inference incurs no additional platform fees. You pay only for the underlying GPU or node usage consumed by your deployments:- Node-based billing: For customers with existing reserved nodes, you can redirect reserved capacity to the inference platform at your existing rates.
- GPU-based billing: For on-demand workloads, you pay per GPU-hour. Choose the GPU type and view current rates on the CoreWeave pricing page.