- Deploy the Red Hat AI Inference Stack (
cert-manager, Istio, LWS operator, and KServe) on your CKS cluster. - Create and verify the inference gateway for routing requests to models.
- Deploy a hello-world model (GPT-OSS) and send a chat completion inference request.
What you'll need
Before you start, you must have:
- A Red Hat Registry service account or Red Hat pull secret for
registry.redhat.io. - A CoreWeave Kubernetes Service (CKS) cluster with GPU Nodes.
kubectlinstalled and configured to access your cluster.KUBECONFIGset to an absolute path (required by the deployment scripts).
What you'll use
You’ll use these tools:
- git: To clone the rhaii-on-xks repository.
- make: To deploy components and run validation.
- jq: To copy the Red Hat pull secret into namespaces (used in later steps).
Prerequisites
Before you start, confirm that you’ve completed the following.Cluster readiness
Your cluster is ready. To verify the available GPU Nodes, run:Red Hat access token
You need a Red Hat pull secret so the cluster can pull images fromregistry.redhat.io. To get a Red Hat service account:
- Go to https://access.redhat.com/terms-based-registry/.
- Click New Service Account.
-
Create the account and note the username (for example,
12345678|myserviceaccount). - On the OpenShift Secret tab, download the service account token.
-
Convert the service account token to JSON:
Replace
PULL-SECRET.yamlwith your file name. Theauth.jsonfile should look like the following: -
Create the directory and copy
auth.jsonto~/.config/containers:
KUBECONFIG
EnsureKUBECONFIG is set to an absolute path:
Clone the repository
Clone the Red Hat AI Inference Stack repository and change into its directory:Deploy prerequisites
Usemake to deploy all stack components that llm-d depends on (cert-manager, Istio, LWS operator, and KServe):
Running state and readiness checks passing:
Expected output after make status
Create the inference gateway
Deploy the inference gateway so you can route requests to your models:inference-gateway in the opendatahub namespace with an ADDRESS and PROGRAMMED set to True:
Expected gateway output
Deploy and test a sample model
After the gateway is running, you can deploy a model and send inference requests. This section uses the redhat-inference example from the CoreWeave doc-examples repository.Set up the namespace
Create a namespace for the deployment. This example usesllm-d-rhaii:
Download and deploy the model
Clone the following repository:redhat-inference directory:
Send an inference request
In a separate terminal, port-forward the inference gateway:Expected response
Next steps
- Deploy an open source LLM on CKS for a full walkthrough of creating a cluster, node pool, and serving a model with Open WebUI.
- Deploy vLLM for inference to run another inference stack with monitoring, autoscaling, and Prometheus or Grafana.
- Observability overview to add monitoring, metrics, and logging for your inference workloads.
- Nodes and node pools to scale GPU capacity or adjust node pool configuration.
- Secrets to manage image pull secrets and other sensitive configuration in your cluster.