Overview
Now that your infrastructure and monitoring are set up, you’ll deploy the vLLM inference service to serve large language model completions from your CKS cluster. This step covers how to configure your deployment, deploy the service, and verify it’s working correctly.Step 1: Configure your deployment
Choose from the example configurations in thehack/ directory, or customize values.yaml for your specific model and requirements.
This guide deploys Llama 3.1 8B Instruct.
Navigate to the inference/basic directory, and run the following command to create a working copy of the example values file:
Step 2: Update cluster-specific settings
These fields tell the chart which CoreWeave cluster and organization the deployment belongs to, so the Ingress hostname is generated correctly. Editmy-values.yaml and update the following required fields:
Replace [CLUSTER-NAME] with your CKS cluster name and [ORG-ID] with your organization ID.
orgID: You can get yourorgIDon the CoreWeave Console setting page.clusterName: You can get your cluster name on the CoreWeave Console Cluster page.
Step 3: Deploy the vLLM service
Install the vLLM inference chart:Step 4: Monitor deployment progress
The Pod must reachRunning status before the service can accept requests. To check Pod status, run:
Debugging tipIf model downloads fail, check the following:
- Ensure internet connectivity from worker Nodes.
- Check Hugging Face token for gated models.
- Verify sufficient storage in the model cache PersistentVolumeClaim (PVC).
-
Check GPU Node availability:
- Verify resource requests don’t exceed Node capacity.
Step 5: Check Service and Ingress
Verify that the Service and Ingress are properly configured:Debugging tipIf the Ingress is not accessible, check the following:
-
Confirm Traefik is running:
-
Check
cert-managerfor certificate issues: - Verify DNS resolution to your cluster’s load balancer.
Step 6: Access and test your inference service
With the Service and Ingress confirmed, you can reach the model from outside the cluster and send it a request to confirm end-to-end functionality.Get the service endpoint
Retrieve the external URL for your vLLM service:Test service health
Verify the service is responding:200.