Dedicated Inference key capabilities
On top of the capabilities common to all CoreWeave Inference options, Dedicated Inference gives you:- Bring Your Own Weights (BYOW): Upload custom model weights to CoreWeave Object Storage and deploy them without operating the underlying clusters.
- Inference runtime selection: Choose a supported inference runtime (
vllm) with version pinning. - Gateways with traffic management: Configure authentication, body-based or path-based routing, and load balancing across multiple deployments behind a single endpoint.
- Capacity claims: Reserve GPU capacity for your inference workloads.
- Zone selection: Deploy gateways into specific CoreWeave Availability Zones to optimize for latency, data locality, or compliance requirements.
- Autoscaling: Configure minimum and maximum replicas with concurrency targets.
Core resources
Dedicated Inference is built around three resource types you create and manage through the Inference API:- Gateways provide routable endpoints that handle authentication, load balancing, and traffic routing to your model deployments. When you create a gateway, you select a CoreWeave Availability Zone and a routing mode (body-based, header-based, or path-based). Each gateway exposes an external-facing API that your applications use to access your models.
- Deployments configure model serving instances, including the inference runtime, GPU type, model weights location in Object Storage, and autoscaling parameters. Each deployment runs your model on dedicated GPU infrastructure and attaches to one or more gateways.
- Capacity claims manage hardware resource reservations and provide reserved GPU capacity for your inference workloads independent of any single deployment.
Manage endpoints
After you create your initial resources, you can manage the full lifecycle of your inference endpoints:- Create: Deploy a model endpoint by creating a gateway and one or more deployments. Select your GPU type, inference runtime, model weights location, and scaling parameters.
- Update: Modify deployment configuration (scaling parameters, GPU type, or model weights) by sending a
PATCHwith the full updated specification. CoreWeave applies updates with rolling rollout where possible. - Delete: Remove deployments and gateways to stop serving and release the associated resources. Delete deployments before their parent gateway.
Monitoring
CoreWeave provides a Grafana dashboard to monitor inference usage and operational metrics. Use it to track request throughput, latency, GPU utilization, and endpoint health across your deployments. For access instructions, see Introduction to CoreWeave Grafana.Pricing
Dedicated Inference incurs no additional platform fees. You pay only for the underlying GPU or node usage that your deployments consume:- Node-based billing: For customers with existing reserved nodes, you can redirect reserved capacity to the inference platform at your existing rates.
- GPU-based billing: For on-demand workloads, you pay per GPU-hour. Choose the GPU type and view current rates on the CoreWeave pricing page.