Configure monitoring and observability
Set up Prometheus and Grafana for monitoring your vLLM inference deployment
Overview
Monitoring and observability are crucial for production inference workloads. This step sets up Prometheus for metrics collection and Grafana for visualization, giving you insights into your vLLM deployment's performance.
The monitoring stack provides comprehensive metrics for:
- Request throughput and latency
- GPU utilization and memory usage
- KV cache performance
- Queue depth and autoscaling metrics
The monitoring stack requires additional cluster resources. Ensure your CKS cluster has at least one CPU node for Prometheus and Grafana to deploy to.
Step 1: Install Prometheus and Grafana
Clone the reference architecture repository:
$git clone https://github.com/coreweave/reference-architecture.git
Navigate to the hack
folder:
$cd reference-architecture/observability/basic/hack
Get your cluster org and cluster name by going to the Cloud Console. Update the values.yaml
file by replacing orgID
, clusterName
, and the hosts
sections with your content.
orgID
: You can get yourorgID
on the CoreWeave Console settings page.clusterName
: You can get your cluster name on the CoreWeave Console Cluster page.
You will need to add your information to the following sections:
orgID: cw0000 # REPLACE WITH YOUR ACTUAL ORGIDclusterName: inference # REPLACE WITH YOUR ACTUAL CLUSTER NAMEgrafana:enabled: truegrafana:ingress:hosts: [&host "grafana.cw0000-inference.coreweave.app"] # REPLACE WITH YOUR ACTUAL GRAFANA HOSTNAME, USING YOUR ORGID AND CLUSTERNAMEtls:- secretName: grafana-tlshosts:- *host
For example, if your orgID
is cw99
and your cluster name is my-inference-cluster
, the values.yaml
would look like the following:
orgID: cw99clusterName: my-inference-clustergrafana:enabled: truegrafana:ingress:hosts: [&host "grafana.cw99-my-inference-cluster.coreweave.app"]tls:- secretName: grafana-tlshosts:- *host
Note that depending on when you created your cluster, you might need to comment out the rest of the file.
Since our example cluster was created after 2025-07-04, the values.yaml
looks like the following:
orgID: cw99clusterName: inference-guidegrafana:enabled: truegrafana:ingress:hosts: [&host "grafana.cw99-my-inference-cluster.coreweave.app"]tls:- secretName: grafana-tlshosts:- *host# If your cluster was created BEFORE 2025-07-04, you MUST use the values below.# If your cluster was created AFTER 2025-07-04, you can comment these values out.#prometheus:# prometheusOperator:# enabled: false# defaultRules:# create: false# prometheus:# # Can also add agent mode if you only want to forward metrics# prometheusSpec:# # remoteWrite: FILL IF NEEDED# image:# registry: quay.io# repository: prometheus/prometheus# tag: v2.54.0# version: 2.54.0
Run the following helm
commands:
$helm repo add prometheus-community https://prometheus-community.github.io/helm-charts$helm repo add grafana https://grafana.github.io/helm-charts$helm repo update
You should see output similar to the following:
Hang tight while we grab the latest from your chart repositories......Successfully got an update from the "coreweave" chart repository...Successfully got an update from the "grafana" chart repository...Successfully got an update from the "prometheus-community" chart repositoryUpdate Complete. ⎈Happy Helming!⎈
Navigate to the observability/basic
directory and deploy the monitoring stack from the observability/basic
directory:
$helm dependency build
You should see something similar to the following:
Hang tight while we grab the latest from your chart repositories......Successfully got an update from the "coreweave" chart repository...Successfully got an update from the "grafana" chart repository...Successfully got an update from the "prometheus-community" chart repositoryUpdate Complete. ⎈Happy Helming!⎈Saving 2 chartsDownloading grafana from repo https://charts.core-services.ingress.coreweave.comDownloading kube-prometheus-stack from repo https://prometheus-community.github.io/helm-chartsDeleting outdated charts
Run the following helm
command:
$helm install observability ./ \--namespace monitoring \--create-namespace \--values ./hack/values.yaml
You should see output similar to the following:
NAME: observabilityLAST DEPLOYED: Mon Aug 17:46:02NAMESPACE: monitoringSTATUS: deployedREVISION: 1
Step 2: Verify monitoring deployment
Run the following command:
$kubectl get pods -n monitoring
You should see output similar to:
NAME READY STATUS RESTARTS AGEobservability-grafana-9868b99df-vsfxl 1/2 Running 0 40sobservability-prometheus-operator-65d54479b7-b62zd 1/1 Running 0 40sprometheus-observability-prometheus-prometheus-0 2/2 Running 0 37s
Step 3: Get Grafana credentials
Retrieve the auto-generated Grafana admin password:
$kubectl get secret observability-grafana -n monitoring \-o=jsonpath='{.data.admin-password}' | base64 --decode; echo
Save this password for accessing the Grafana dashboard.
Step 4: Create model cache storage
Navigate to the inference/basic
directory:
$cd ../../inference/basic
Create the inference namespace:
$kubectl create namespace inference
Create model cache PVC:
$kubectl apply -f hack/huggingface-model-cache.yaml
Step 5: Set up Hugging Face authentication (if needed)
For models that require authentication, like the Llama 3.1 8B Instruct, create a secret with your Hugging Face token.
$export HF_TOKEN="your-huggingface-token-here"$kubectl create secret generic hf-token \-n inference \--from-literal=token="$HF_TOKEN"
You should see output similar to the following:
secret/hf-token created
Step 6: Create Grafana dashboard for vLLM
Add the vLLM monitoring dashboard to Grafana:
$kubectl apply -f hack/manifests-grafana.yaml -n inference
You should see output similar to the following:
configmap/vllm created
This creates a ConfigMap that Grafana will automatically detect and load as a dashboard.
Step 7: Install autoscaling support (optional)
For production workloads, install KEDA to enable automatic scaling based on demand by running the following helm
commands:
$helm repo add kedacore https://kedacore.github.io/charts$helm repo update$helm install keda kedacore/keda \--namespace keda \--create-namespace
Verify KEDA is running:
$kubectl get pods -n keda
You should see output similar to the following:
NAME READY STATUS RESTARTS AGEkeda-admission-webhooks-7fc99cdd4d-vbkx2 1/1 Running 0 6m48skeda-operator-54ffcbbfd6-fmhxw 1/1 Running 1 (6m46s ago) 6m48skeda-operator-metrics-apiserver-c5b6f8b88-pzjjv 1/1 Running 0 6m48s
What's next
Your monitoring and observability stack is now configured! In the next step, you'll deploy the vLLM inference service.