Choose a deployment option
Serverless Inference
Run inference without managing infrastructure. CoreWeave handles provisioning, scaling, and operations automatically. Choose from a CoreWeave-managed catalog of models.
Dedicated Inference
Bring your own model weights and deploy on dedicated GPU infrastructure. Full control over gateways, scaling, and capacity reservations, while CoreWeave manages the underlying clusters.
Inference on CKS
Run inference workloads on CoreWeave Kubernetes Service with complete control over your deployment stack, runtimes, and networking.
Key capabilities
CoreWeave Inference provides the following capabilities across all options:- NVIDIA GPU access: Run on H100, B200, A100, and other NVIDIA GPU types.
- OpenAI-compatible endpoints: Connect with existing OpenAI client libraries, agents, and tooling.
- CoreWeave platform integration: Use the same accounts, Identity and Access Management (IAM), observability, and billing as the rest of CoreWeave.
Access the API
You configure Dedicated Inference through the CoreWeave Inference management API, available over:- REST/JSON: HTTP/1.1 with JSON, available at
api.coreweave.com. - gRPC: Protocol buffers over HTTP/2.
- Terraform: Manage inference resources through the CoreWeave Terraform Provider.
Pricing
CoreWeave Inference has no additional platform fees. You are billed only for the underlying GPU or node usage consumed by your workloads. The exact billing model depends on the deployment option you choose:- Serverless Inference: pay-per-token billing for catalog models.
- Dedicated Inference: GPU-hour or node-based billing, depending on your contract. For details, see Dedicated Inference pricing.
- Inference on CKS: standard CKS billing (GPU-hour or reserved node).