Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

CoreWeave Inference incurs no additional platform fees. You are billed only for the underlying compute resources consumed by your deployments.

Billing models

Node-based billing is for customers who already have reserved node capacity. You can redirect existing reserved nodes for training or other workloads to the inference platform at your existing rates. There are no additional charges beyond the existing reservation cost. GPU-based billing is for on-demand workloads. You pay per GPU-hour based on the instance type. Inference compute is measured in GPU-hours at the deployment level. Pricing is transparent and available on the CoreWeave pricing page.

Cost optimization

Follow these guidelines to reduce inference costs.
  • Right-size your GPU selection. Choose the smallest instance type that meets your model’s memory and throughput requirements.
  • Use autoscaling to match demand. Scaling down during low-traffic periods reduces costs. Set min to the lowest value that meets your latency requirements.
  • Consider reserved capacity for steady-state workloads. Capacity claims with reserved nodes offer predictable pricing for workloads with consistent demand.
  • Monitor replica utilization. If replicas are consistently underutilized, consider reducing max or switching to a smaller instance type.
  • Use scaling priority. When multiple deployments share reserved capacity, set priority so that higher-value workloads scale first.
Last modified on May 6, 2026