Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Now that your vLLM inference service is running, you’ll learn how to monitor its performance using Grafana dashboards and test the autoscaling capabilities. This step covers accessing metrics, setting up alerts, and validating autoscaling behavior.Step 1: Access Grafana dashboard
Get your Grafana URL:- Username:
admin - Password: (You generated this in the Get Grafana credentials section.)
Step 2: View vLLM metrics
In Grafana:- Go to Dashboards → Browse
- Select the vLLM Dashboard
- Monitor key metrics like:
- Token throughput
- Cache utilization
- Queue depth
Congratulations
You’ve deployed an open-source LLM on CKS with observability for monitoring. The remainder of the steps on this page are optional. To free up your resources after completing this guide, be sure to complete the Cleanup steps.Step 3: Set up alerts (optional)
Configure Grafana alerts for important metrics like:- High error rates
- Elevated response times
- GPU memory usage
- Service availability
Step 4: Test autoscaling (optional)
If you installed KEDA, test the autoscaling behavior.To scale on different metrics, you can create a custom
ScaledObject. For information about ScaledObject, see the KEDA ScaledObject specification.You can also change the autoscaling threshold in the helm values overrides: vllm.workload.deployment.autoscale.cacheUtilizationThreshold.Generate load
Use the included load testing script. Navigate to theinference/basic/hack/tests directory and run the following command:
Note that it takes a few minutes to run.
results.json, you should see something similar to the following:
Monitor scaling
Watch the Pods scale up based on demand:
Step 5: Using your inference service (optional)
Python OpenAI client
Use the standard OpenAI Python library:base_url and model variables:
base_url=f"https://basic-inference.cw**-infer.coreweave.app/v1"model="meta-llama/Llama-3.1-8B-Instruct"
Streaming responses
For real-time applications, use streaming:Step 6: Advanced configuration (optional)
Multi-node deployment
For larger models, enable multi-node deployment by setting:Custom model configuration
To use a different model, update your values file:Resource tuning
Adjust resource requests and limits based on your model size:Cleanup
To remove the deployment and free up resources:Next steps
Now that you have a working vLLM deployment, consider:- Adding authentication: Implement API key validation or OAuth
- Model routing: Deploy multiple models with intelligent traffic routing
- Cost optimization: Set up Node autoscaling and instance usage
- Advanced monitoring: Configure custom alerts and dashboards
- CI/CD integration: Automate deployments with GitOps workflows