CoreWeave Inference is a family of model-serving products that run on CoreWeave GPU infrastructure. Choose the deployment option that matches how much control you want over the serving stack: from fully managed (Serverless) to fully self-managed (CKS), with a managed-but-configurable middle ground (Dedicated). All three options expose OpenAI API-compatible endpoints, so existing tools and applications that work with the OpenAI API can connect with minimal changes.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Choose a deployment option
Serverless Inference
Run inference without managing infrastructure. CoreWeave handles provisioning, scaling, and operations automatically. Choose from a CoreWeave-managed catalog of models.
Dedicated Inference
Bring your own model weights and deploy on dedicated GPU infrastructure. Full control over gateways, scaling, and capacity reservations, while CoreWeave manages the underlying clusters.
Inference on CKS
Run inference workloads on CoreWeave Kubernetes Service with complete control over your deployment stack, runtimes, and networking.
Key capabilities
CoreWeave Inference provides the following capabilities across all options:- NVIDIA GPU access: Run on H100, B200, A100, and other NVIDIA GPU types.
- OpenAI-compatible endpoints: Connect with existing OpenAI client libraries, agents, and tooling.
- CoreWeave platform integration: Use the same accounts, IAM, observability, and billing as the rest of CoreWeave.
Access the API
Dedicated Inference is configured through the CoreWeave Inference management API, available over:- REST/JSON: HTTP/1.1 with JSON, available at
api.coreweave.com. - gRPC: Protocol buffers over HTTP/2.
- Terraform: Manage inference resources through the CoreWeave Terraform Provider.
Pricing
CoreWeave Inference has no additional platform fees. You are billed only for the underlying GPU or node usage consumed by your workloads. The exact billing model depends on the deployment option you choose:- Serverless Inference: Pay-per-token billing for catalog models.
- Dedicated Inference: GPU-hour or node-based billing, depending on your contract. See Dedicated Inference pricing for details.
- Inference on CKS: Standard CKS billing (GPU-hour or reserved node).