Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Now that your infrastructure and monitoring are set up, you’ll deploy the vLLM inference service. This step covers configuring your deployment, deploying the service, and verifying it’s working correctly.Step 1: Configure your deployment
Choose from the example configurations in thehack/ directory, or customize values.yaml for your specific model and requirements.
For this guide, we’ll deploy Llama 3.1 8B Instruct.
Navigate to the inference/basic directory, and run the following command:
Step 2: Update cluster-specific settings
Editmy-values.yaml and update the following required fields:
orgID: You can get yourorgIDon the CoreWeave Console setting page.clusterName: You can get your cluster name on the CoreWeave Console Cluster page.
Step 3: Deploy the vLLM service
Install the vLLM inference chart:Step 4: Monitor deployment progress
Watch the deployment status by running the following command to check pod status:Debugging tipModel download failures:
- Ensure internet connectivity from worker nodes
- Check Hugging Face token for gated models
- Verify sufficient storage in model cache PVC
- Check GPU node availability:
kubectl get nodes -l node-role.kubernetes.io/worker=true - Verify resource requests don’t exceed node capacity
Step 5: Check service and ingress
Verify that the service and ingress are properly configured:Debugging tipIngress not accessible:
- Confirm Traefik is running:
kubectl get pods -n traefik - Check cert-manager for certificate issues:
kubectl get certificates -n inference - Verify DNS resolution to your cluster’s load balancer
Step 6: Access and test your inference service
Get the service endpoint
Retrieve the external URL for your vLLM service:Test service health
Verify the service is responding:200.
Get available models
List the loaded models:Run inference
Test the model with a simple chat completion:JSON.