Deploy vLLM for Inference

Deploy and scale vLLM inference workloads on CoreWeave Kubernetes Service (CKS)

Outline

This long-form tutorial is comprised of the pages underneath this section. They are designed to be followed in the order they are numbered.

In this tutorial, you will:

🚀 What you'll need

Before you start, you must have:

A working CKS cluster with GPU and CPU nodes.
The following tools installed on your local machine:
- Kubectl which needs to be connected to your cluster.
- Helm version 3.8+
- Git
Access to Hugging Face models (with tokens if required for gated models).

🔧 What you'll use

In this tutorial, you'll use the following tools and services:

vLLM: High-performance LLM inference engine
Traefik: Ingress controller for external traffic routing
cert-manager: Automatic TLS certificate management
Prometheus & Grafana: Monitoring and observability stack
KEDA: Kubernetes Event-driven Autoscaling
CoreWeave Helm charts: Pre-configured deployment templates

Architecture overview

The complete vLLM inference solution consists of several components working together:

vLLM service: The main inference engine running your language model
Traefik ingress: Handles external traffic routing and TLS termination
cert-manager: Manages automatic SSL certificate generation and renewal
Prometheus: Collects metrics from vLLM and other components
Grafana: Provides dashboards for monitoring inference performance
KEDA: Enables autoscaling based on custom metrics like request queue depth

Prerequisites

Verify the following:

You can access your cluster using kubectl.

For example, run the following command:

Example
```
$
kubectl cluster-info
```
You should see something similar to the following:
```
Kubernetes control plane is running at...
CoreDNS is running at...
node-local-dns is running at...
```

Your cluster has at least one CPU node.

For example, run the following command:

Example

$
kubectl get nodes -o=custom-columns="NAME:metadata.name,CLASS:metadata.labels['node\.coreweave\.cloud\/class']"

You should see something similar to the following:

NAME      CLASS
g137a10   gpu
g5424e0   cpu
g77575e   cpu
gd926d4   gpu

Ensure your CKS cluster has GPU Nodes with at least 16GB of GPU memory, required by the Llama 3.1 8B Instruct model used in this tutorial.

For example, run the following command:

Example

$
kubectl get nodes -o=custom-columns="NAME:metadata.name,IP:status.addresses[?(@.type=='InternalIP')].address,TYPE:metadata.labels['node\.coreweave\.cloud\/type'],RESERVED:metadata.labels['node\.coreweave\.cloud\/reserved'],NODEPOOL:metadata.labels['compute\.coreweave\.com\/node-pool'],READY:status.conditions[?(@.type=='Ready')].status,GPU:metadata.labels['gpu\.nvidia\.com/model'],VRAM:metadata.labels['gpu\.nvidia\.com/vram']"

You should see something similar to the following:

NAME      IP               TYPE         RESERVED   NODEPOOL    READY   GPU           VRAM
g80eac0   10.176.212.195   gd-1xgh200   cw9a2f     infer-gpu   True    GH200_480GB   97
gf2809a   10.176.244.33    turin-gp-l   cw9a2f     infer-cpu   True

Under VRAM, the number should be 16 or greater.

Tip

To further debug and diagnose cluster problems, use kubectl cluster-info dump.

Additional resources and information

The following tools are preinstalled on CKS worker nodes:

Docker: Container runtime for running vLLM inference pods
NVIDIA drivers: GPU drivers for CUDA acceleration
CoreWeave CSI drivers: Storage drivers for persistent volumes
CoreWeave CNI: Network plugins for pod communication

To learn more about vLLM, and inference on CoreWeave, check out the following resources:

Next ➡ Set up infrastructure dependencies

Outline​

Architecture overview​

Prerequisites​

Additional resources and information​

Outline

Architecture overview

Prerequisites

Additional resources and information