Skip to main content

Deploy vLLM for Inference

Deploy and scale vLLM inference workloads on CoreWeave Kubernetes Service (CKS)

Outline

This long-form tutorial is comprised of the pages underneath this section. They are designed to be followed in the order they are numbered.

In this tutorial, you will:

  1. Set up infrastructure dependencies.
  2. Configure monitoring and observability.
  3. Deploy vLLM inference service.
  4. Monitor performance and test autoscaling.

Architecture overview

The complete vLLM inference solution consists of several components working together:

  • vLLM service: The main inference engine running your language model
  • Traefik ingress: Handles external traffic routing and TLS termination
  • cert-manager: Manages automatic SSL certificate generation and renewal
  • Prometheus: Collects metrics from vLLM and other components
  • Grafana: Provides dashboards for monitoring inference performance
  • KEDA: Enables autoscaling based on custom metrics like request queue depth

Prerequisites

Verify the following:

  • You can access your cluster using kubectl.

    For example, run the following command:

    Example
    $
    kubectl cluster-info

    You should see something similar to the following:

    Kubernetes control plane is running at...
    CoreDNS is running at...
    node-local-dns is running at...
  • Your cluster has at least one CPU node.

    For example, run the following command:

    Example
    $
    kubectl get nodes -o=custom-columns="NAME:metadata.name,CLASS:metadata.labels['node\.coreweave\.cloud\/class']"

    You should see something similar to the following:

    NAME CLASS
    g137a10 gpu
    g5424e0 cpu
    g77575e cpu
    gd926d4 gpu
  • Ensure your CKS cluster has GPU Nodes with at least 16GB of GPU memory, required by the Llama 3.1 8B Instruct model used in this tutorial.

    For example, run the following command:

    Example
    $
    kubectl get nodes -o=custom-columns="NAME:metadata.name,IP:status.addresses[?(@.type=='InternalIP')].address,TYPE:metadata.labels['node\.coreweave\.cloud\/type'],RESERVED:metadata.labels['node\.coreweave\.cloud\/reserved'],NODEPOOL:metadata.labels['compute\.coreweave\.com\/node-pool'],READY:status.conditions[?(@.type=='Ready')].status,GPU:metadata.labels['gpu\.nvidia\.com/model'],VRAM:metadata.labels['gpu\.nvidia\.com/vram']"

    You should see something similar to the following:

    NAME IP TYPE RESERVED NODEPOOL READY GPU VRAM
    g80eac0 10.176.212.195 gd-1xgh200 cw9a2f infer-gpu True GH200_480GB 97
    gf2809a 10.176.244.33 turin-gp-l cw9a2f infer-cpu True

    Under VRAM, the number should be 16 or greater.

    Tip

    To further debug and diagnose cluster problems, use kubectl cluster-info dump.

Additional resources and information

The following tools are preinstalled on CKS worker nodes:

  • Docker: Container runtime for running vLLM inference pods
  • NVIDIA drivers: GPU drivers for CUDA acceleration
  • CoreWeave CSI drivers: Storage drivers for persistent volumes
  • CoreWeave CNI: Network plugins for pod communication

To learn more about vLLM, and inference on CoreWeave, check out the following resources: