Overview
Now that your infrastructure and monitoring are set up, you’ll deploy the vLLM inference service. This step covers configuring your deployment, deploying the service, and verifying it’s working correctly.Step 1: Configure your deployment
Choose from the example configurations in thehack/ directory, or customize values.yaml for your specific model and requirements.
For this guide, we’ll deploy Llama 3.1 8B Instruct.
Navigate to the inference/basic directory, and run the following command:
Step 2: Update cluster-specific settings
Editmy-values.yaml and update the following required fields:
orgID: You can get yourorgIDon the CoreWeave Console setting page.clusterName: You can get your cluster name on the CoreWeave Console Cluster page.
Step 3: Deploy the vLLM service
Install the vLLM inference chart:Step 4: Monitor deployment progress
Watch the deployment status by running the following command to check pod status:Debugging tipModel download failures:
- Ensure internet connectivity from worker nodes
- Check Hugging Face token for gated models
- Verify sufficient storage in model cache PVC
- Check GPU node availability:
kubectl get nodes -l node-role.kubernetes.io/worker=true - Verify resource requests don’t exceed node capacity
Step 5: Check service and ingress
Verify that the service and ingress are properly configured:Debugging tipIngress not accessible:
- Confirm Traefik is running:
kubectl get pods -n traefik - Check cert-manager for certificate issues:
kubectl get certificates -n inference - Verify DNS resolution to your cluster’s load balancer
Step 6: Access and test your inference service
Get the service endpoint
Retrieve the external URL for your vLLM service:Test service health
Verify the service is responding:200.
Get available models
List the loaded models:Run inference
Test the model with a simple chat completion:JSON.