3. Deploy vLLM inference service

Deploy and configure your vLLM inference service with model caching and autoscaling

Overview

Now that your infrastructure and monitoring are set up, you'll deploy the vLLM inference service. This step covers configuring your deployment, deploying the service, and verifying it's working correctly.

Step 1: Configure your deployment

Choose from the example configurations in the hack/ directory, or customize values.yaml for your specific model and requirements.

For this guide, we'll deploy Llama 3.1 8B Instruct.

Navigate to the inference/basic directory, and run the following command:

Example

$
cp hack/values-llama-small.yaml my-values.yaml

Step 2: Update cluster-specific settings

Edit my-values.yaml and update the following required fields:

Example

ingress:
  clusterName: "your-cluster-name"    # Replace with your CKS cluster name
  orgID: "your-org-id"                # Replace with your organization ID

orgID: You can get your orgID on the CoreWeave Console setting page.
clusterName: You can get your cluster name on the CoreWeave Console Cluster page.

Step 3: Deploy the vLLM service

Install the vLLM inference chart:

Example

$
helm install basic-inference ./ \
  --namespace inference \
  --create-namespace \
  -f my-values.yaml

You should see output similar to the following:

NAME: basic-inference
LAST DEPLOYED: Mon Aug 18:10:27
NAMESPACE: inference
STATUS: deployed
REVISION: 1
TEST SUITE: None

Step 4: Monitor deployment progress

Watch the deployment status by running the following command to check pod status:

Example

$
kubectl get pods -n inference -w

The initial deployment may take several minutes as the model weights are downloaded and cached.

You should see output similar to:

NAME                               READY   STATUS    RESTARTS   AGE
basic-inference-577c5675c8-vm2nc   1/1     Running   0          13m

Once you see the pod is running, exit the process and go to the next step.

Debugging tip

Model download failures:

Ensure internet connectivity from worker nodes
Check Hugging Face token for gated models
Verify sufficient storage in model cache PVC

Pods stuck in pending state:

Check GPU node availability: kubectl get nodes -l node-role.kubernetes.io/worker=true
Verify resource requests don't exceed node capacity

Step 5: Check service and ingress

Verify that the service and ingress are properly configured:

Example

$
kubectl get svc,ingress -n inference

You should see output similar to:

NAME                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/basic-inference-vllm ClusterIP   10.96.123.45    <none>        8000/TCP  5m

NAME                                      CLASS    HOSTS                                         ADDRESS      PORTS     AGE
ingress.networking.k8s.io/basic-inference traefik  basic-inference.your-cluster.coreweave.cloud  ***.**       80, 443   5m

Debugging tip

Ingress not accessible:

Confirm Traefik is running: kubectl get pods -n traefik
Check cert-manager for certificate issues: kubectl get certificates -n inference
Verify DNS resolution to your cluster's load balancer

Step 6: Access and test your inference service

Get the service endpoint

Retrieve the external URL for your vLLM service:

Example

$
export VLLM_ENDPOINT="$(kubectl get ingress basic-inference -n inference -o=jsonpath='{.spec.rules[0].host}')"
echo "Your vLLM endpoint: https://$VLLM_ENDPOINT"

Test service health

Verify the service is responding:

Example

$
curl -s -o /dev/null -w "%{http_code}" https://$VLLM_ENDPOINT/health

You should see 200.

Get available models

List the loaded models:

Example

$
export VLLM_MODEL="$(curl -s https://$VLLM_ENDPOINT/v1/models | jq -r '.data[].id')"
echo "Available model: $VLLM_MODEL"

You should see the following output:

Available model: meta-llama/Llama-3.1-8B-Instruct

Run inference

Test the model with a simple chat completion:

Example

$
curl -X POST "https://$VLLM_ENDPOINT/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "'"$VLLM_MODEL"'",
        "messages": [
          { "role": "system", "content": "You are a helpful AI assistant." },
          { "role": "user", "content": "Explain the benefits of GPU acceleration for LLM inference." }
        ],
        "max_tokens": 500,
        "temperature": 0.7
      }'

You should see the returned output in JSON.

What's next

Your vLLM inference service is now deployed and running! In the next step, you'll monitor performance and test autoscaling.

Tip

The initial model download can take 10-30 minutes depending on the model size and network conditions. You can monitor the download progress in the pod logs:

Example

$
kubectl logs -n inference -l app=basic-inference -f

General debugging commands for this guide:

Example

# Check vLLM pod logs
kubectl logs -n inference -l app=basic-inference

# Describe pod for events
kubectl describe pod -n inference -l app=basic-inference

# Check ingress status
kubectl describe ingress basic-inference -n inference

# Monitor resource usage
kubectl top pods -n inference

⬅ Previous Configure observability and monitoring

Next ➡ Monitoring performance and test autoscaling

Overview​

Step 1: Configure your deployment​

Step 2: Update cluster-specific settings​

Step 3: Deploy the vLLM service​

Step 4: Monitor deployment progress​

Step 5: Check service and ingress​

Step 6: Access and test your inference service​

Get the service endpoint​

Test service health​

Get available models​

Run inference​

What's next​

Overview

Step 1: Configure your deployment

Step 2: Update cluster-specific settings

Step 3: Deploy the vLLM service

Step 4: Monitor deployment progress

Step 5: Check service and ingress

Step 6: Access and test your inference service

Get the service endpoint

Test service health

Get available models

Run inference

What's next