Deploy vLLM inference service
Deploy and configure your vLLM inference service with model caching and autoscaling
Overview
Now that your infrastructure and monitoring are set up, you'll deploy the vLLM inference service. This step covers configuring your deployment, deploying the service, and verifying it's working correctly.
Step 1: Configure your deployment
Choose from the example configurations in the hack/
directory, or customize values.yaml
for your specific model and requirements.
For this guide, we'll deploy Llama 3.1 8B Instruct.
Navigate to the inference/basic
directory, and run the following command:
$cp hack/values-llama-small.yaml my-values.yaml
Step 2: Update cluster-specific settings
Edit my-values.yaml
and update the following required fields:
ingress:clusterName: "your-cluster-name" # Replace with your CKS cluster nameorgID: "your-org-id" # Replace with your organization ID
orgID
: You can get yourorgID
on the CoreWeave Console setting page.clusterName
: You can get your cluster name on the CoreWeave Console Cluster page.
Step 3: Deploy the vLLM service
Install the vLLM inference chart:
$helm install basic-inference ./ \--namespace inference \--create-namespace \-f my-values.yaml
You should see output similar to the following:
NAME: basic-inferenceLAST DEPLOYED: Mon Aug 18:10:27NAMESPACE: inferenceSTATUS: deployedREVISION: 1TEST SUITE: None
Step 4: Monitor deployment progress
Watch the deployment status by running the following command to check pod status:
$kubectl get pods -n inference -w
The initial deployment may take several minutes as the model weights are downloaded and cached.
You should see output similar to:
NAME READY STATUS RESTARTS AGEbasic-inference-577c5675c8-vm2nc 1/1 Running 0 13m
Once you see the pod is running, exit the process and go to the next step.
Model download failures:
- Ensure internet connectivity from worker nodes
- Check Hugging Face token for gated models
- Verify sufficient storage in model cache PVC
Pods stuck in pending state:
- Check GPU node availability:
kubectl get nodes -l node-role.kubernetes.io/worker=true
- Verify resource requests don't exceed node capacity
Step 5: Check service and ingress
Verify that the service and ingress are properly configured:
$kubectl get svc,ingress -n inference
You should see output similar to:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEservice/basic-inference-vllm ClusterIP 10.96.123.45 <none> 8000/TCP 5mNAME CLASS HOSTS ADDRESS PORTS AGEingress.networking.k8s.io/basic-inference traefik basic-inference.your-cluster.coreweave.cloud ***.** 80, 443 5m
Ingress not accessible:
- Confirm Traefik is running:
kubectl get pods -n traefik
- Check cert-manager for certificate issues:
kubectl get certificates -n inference
- Verify DNS resolution to your cluster's load balancer
Step 6: Access and test your inference service
Get the service endpoint
Retrieve the external URL for your vLLM service:
$export VLLM_ENDPOINT="$(kubectl get ingress basic-inference -n inference -o=jsonpath='{.spec.rules[0].host}')"echo "Your vLLM endpoint: https://$VLLM_ENDPOINT"
Test service health
Verify the service is responding:
$curl -s -o /dev/null -w "%{http_code}" https://$VLLM_ENDPOINT/health
You should see 200
.
Get available models
List the loaded models:
$export VLLM_MODEL="$(curl -s https://$VLLM_ENDPOINT/v1/models | jq -r '.data[].id')"echo "Available model: $VLLM_MODEL"
You should see the following output:
Available model: meta-llama/Llama-3.1-8B-Instruct
Run inference
Test the model with a simple chat completion:
$curl -X POST "https://$VLLM_ENDPOINT/v1/chat/completions" \-H "Content-Type: application/json" \-d '{"model": "'"$VLLM_MODEL"'","messages": [{ "role": "system", "content": "You are a helpful AI assistant." },{ "role": "user", "content": "Explain the benefits of GPU acceleration for LLM inference." }],"max_tokens": 500,"temperature": 0.7}'
You should see the returned output in JSON
.
What's next
Your vLLM inference service is now deployed and running! In the next step, you'll monitor performance and test autoscaling.
The initial model download can take 10-30 minutes depending on the model size and network conditions. You can monitor the download progress in the pod logs:
$kubectl logs -n inference -l app=basic-inference -f
General debugging commands for this guide:
# Check vLLM pod logskubectl logs -n inference -l app=basic-inference# Describe pod for eventskubectl describe pod -n inference -l app=basic-inference# Check ingress statuskubectl describe ingress basic-inference -n inference# Monitor resource usagekubectl top pods -n inference