> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# 3. Deploy vLLM inference service

> Deploy and configure your vLLM inference service with model caching and autoscaling

## Overview

Now that your infrastructure and monitoring are set up, you'll deploy the vLLM inference service. This step covers configuring your deployment, deploying the service, and verifying it's working correctly.

## Step 1: Configure your deployment

Choose from the example configurations in the `hack/` directory, or customize `values.yaml` for your specific model and requirements.

For this guide, we'll deploy Llama 3.1 8B Instruct.

Navigate to the `inference/basic` directory, and run the following command:

```bash theme={"system"}
cp hack/values-llama-small.yaml my-values.yaml
```

## Step 2: Update cluster-specific settings

Edit `my-values.yaml` and update the following required fields:

```yaml theme={"system"}
ingress:
  clusterName: "your-cluster-name"    # Replace with your CKS cluster name
  orgID: "your-org-id"                # Replace with your organization ID
```

* `orgID`: You can get your `orgID` on the [CoreWeave Console setting page](https://console.coreweave.com/account/settings).
* `clusterName`: You can get your cluster name on the [CoreWeave Console Cluster page](https://console.coreweave.com/clusters).

## Step 3: Deploy the vLLM service

Install the vLLM inference chart:

```bash theme={"system"}
helm install basic-inference ./ \
  --namespace inference \
  --create-namespace \
  -f my-values.yaml
```

You should see output similar to the following:

```text theme={"system"}
NAME: basic-inference
LAST DEPLOYED: Mon Aug 18:10:27
NAMESPACE: inference
STATUS: deployed
REVISION: 1
TEST SUITE: None
```

## Step 4: Monitor deployment progress

Watch the deployment status by running the following command to check pod status:

```bash theme={"system"}
kubectl get pods -n inference -w
```

The initial deployment may take several minutes as the model weights are downloaded and cached.

You should see output similar to:

```text theme={"system"}
NAME                               READY   STATUS    RESTARTS   AGE
basic-inference-577c5675c8-vm2nc   1/1     Running   0          13m
```

Once you see the pod is running, exit the process and go to the next step.

<Note>
  **Debugging tip**

  Model download failures:

  * Ensure internet connectivity from worker nodes
  * Check Hugging Face token for gated models
  * Verify sufficient storage in model cache PVC

  Pods stuck in pending state:

  * Check GPU node availability: `kubectl get nodes -l node-role.kubernetes.io/worker=true`
  * Verify resource requests don't exceed node capacity
</Note>

## Step 5: Check service and ingress

Verify that the service and ingress are properly configured:

```bash theme={"system"}
kubectl get svc,ingress -n inference
```

You should see output similar to:

```text theme={"system"}
NAME                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/basic-inference-vllm ClusterIP   10.96.123.45    <none>        8000/TCP  5m

NAME                                      CLASS    HOSTS                                         ADDRESS      PORTS     AGE
ingress.networking.k8s.io/basic-inference traefik  basic-inference.your-cluster.coreweave.cloud  ***.**       80, 443   5m
```

<Note>
  **Debugging tip**

  Ingress not accessible:

  * Confirm Traefik is running: `kubectl get pods -n traefik`
  * Check cert-manager for certificate issues: `kubectl get certificates -n inference`
  * Verify DNS resolution to your cluster's load balancer
</Note>

## Step 6: Access and test your inference service

### Get the service endpoint

Retrieve the external URL for your vLLM service:

```bash theme={"system"}
export VLLM_ENDPOINT="$(kubectl get ingress basic-inference -n inference -o=jsonpath='{.spec.rules[0].host}')"
echo "Your vLLM endpoint: https://$VLLM_ENDPOINT"
```

### Test service health

Verify the service is responding:

```bash theme={"system"}
curl -s -o /dev/null -w "%{http_code}" https://$VLLM_ENDPOINT/health
```

You should see `200`.

### Get available models

List the loaded models:

```bash theme={"system"}
export VLLM_MODEL="$(curl -s https://$VLLM_ENDPOINT/v1/models | jq -r '.data[].id')"
echo "Available model: $VLLM_MODEL"
```

You should see the following output:

```text theme={"system"}
Available model: meta-llama/Llama-3.1-8B-Instruct
```

### Run inference

Test the model with a simple chat completion:

```bash theme={"system"}
curl -X POST "https://$VLLM_ENDPOINT/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "'"$VLLM_MODEL"'",
        "messages": [
          { "role": "system", "content": "You are a helpful AI assistant." },
          { "role": "user", "content": "Explain the benefits of GPU acceleration for LLM inference." }
        ],
        "max_tokens": 500,
        "temperature": 0.7
      }'
```

You should see the returned output in `JSON`.

## What's next

Your vLLM inference service is now deployed and running! In the next step, you'll [monitor performance and test autoscaling](/products/cks/tutorials/deploy-vllm-inference/4-monitor-and-test).

The initial model download can take 10-30 minutes depending on the model size and network conditions. You can monitor the download progress in the pod logs:

```bash theme={"system"}
kubectl logs -n inference -l app=basic-inference -f
```

General debugging commands for this guide:

```bash theme={"system"}
# Check vLLM pod logs
kubectl logs -n inference -l app=basic-inference

# Describe pod for events
kubectl describe pod -n inference -l app=basic-inference

# Check ingress status
kubectl describe ingress basic-inference -n inference

# Monitor resource usage
kubectl top pods -n inference
```
