Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

This tutorial shows you how to deploy the Red Hat AI Inference Stack for Kubernetes on CoreWeave Kubernetes Service (CKS). The stack provides GPU-based LLM inference using llm-d, KServe, Istio, and the Gateway API so you can run and serve models such as GPT-OSS on your CKS cluster. In this tutorial, you will:
  • Deploy the Red Hat AI Inference Stack (cert-manager, Istio, LWS operator, and KServe) on your CKS cluster.
  • Create and verify the inference gateway for routing requests to models.
  • Deploy a hello-world model (GPT-OSS) and send a chat completion inference request.

What you'll need

Before you start, you must have:
  • A Red Hat Registry service account or Red Hat pull secret for registry.redhat.io.
  • A CoreWeave Kubernetes Service (CKS) cluster with GPU nodes.
  • kubectl installed and configured to access your cluster.
  • KUBECONFIG set to an absolute path (required by the deployment scripts).

What you'll use

You’ll use these tools:
  • git: To clone the rhaii-on-xks repository.
  • make: To deploy components and run validation.
  • jq: To copy the Red Hat pull secret into namespaces (used in later steps).

Prerequisites

Before completing the tutorial, please confirm you have the following prerequisites completed.

Cluster readiness

Your cluster is ready. Verify by checking the GPU nodes that are available:
kubectl describe nodes | grep -A5 "nvidia.com/gpu"
You should see one or more nodes listed and GPU capacity in the describe output.

Red Hat access token

You need a Red Hat pull secret so the cluster can pull images from registry.redhat.io. Get a Red Hat service account by completing the following:
  1. Go to: https://access.redhat.com/terms-based-registry/
  2. Click “New Service Account”
  3. Create account and note the username (e.g., 12345678|myserviceaccount)
  4. Download the service account token on the OpenShift Secret tab
  5. Convert the service account token into json:
    yq e '.data.".dockerconfigjson"' PULL-SECRET.yaml | base64 -d > auth.json
    
    Replace PULL-SECRET.yaml with your file name. The auth.json file should look like the following:
     {
     "auths": {
      "registry.redhat.io": {
        "auth": "MjAyOTk4MTd8Y29yZXdlY*******"
        }
       }
      }
    
  6. Create the directory and copy auth.json to ~/.config/containers:
    mkdir -p ~/.config/containers
    cp ~/auth.json ~/.config/containers/auth.json
    

KUBECONFIG

Your $KUBECONFIG is set to an absolute path:
export KUBECONFIG="$HOME/.kube/config"

Clone the repository

Clone the Red Hat AI Inference Stack repository and change into its directory:
git clone https://github.com/opendatahub-io/rhaii-on-xks.git
cd rhaii-on-xks

Deploy prerequisites

Use make to deploy all stack components that llm-d depends on (cert-manager, Istio, LWS operator, and KServe):
make deploy-all
When the deployment finishes, run:
make status
You should see output similar to the following, with components in Running state and readiness checks passing:
Expected output after make status
== Deployment Status ===
cert-manager-operator:
NAME                                                        READY   STATUS    RESTARTS   AGE
cert-manager-operator-controller-manager-6d46d864cf-6dc7z   1/1     Running   0          2m24s

cert-manager:
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-699cdbb7db-nc9jv              1/1     Running   0          2m11s
...

=== Readiness Checks ===
-n cert-manager webhook:
Ready

=== API Versions ===
-n InferencePool API:
v1 (inference.networking.k8s.io)
-n Istio version:
v1.27-latest
-n Istio status:
Healthy

-n GatewayClass 'istio':
Available
For full deployment details, see the Red Hat guide on configuring the inference gateway.

Create the inference gateway

Deploy the inference gateway so you can route requests to your models:
./scripts/setup-gateway.sh
Verify the gateway is programmed:
kubectl get gateways -A
You should see the inference-gateway in the opendatahub namespace with an ADDRESS and PROGRAMMED set to True:
Expected gateway output
NAMESPACE     NAME                CLASS   ADDRESS     PROGRAMMED   AGE
opendatahub   inference-gateway   istio   10.**.**    True         18m

Hello, World deployment

After the gateway is running, you can deploy a model and send inference requests. This section uses the redhat-inference example from the CoreWeave doc-examples repo.

Setup

Create a namespace for the deployment. Here we use llm-d-rhaii:
export NAMESPACE=llm-d-rhaii
kubectl create namespace $NAMESPACE
Copy the Red Hat pull secret into the namespace and configure the default service account to use it:
kubectl get secret redhat-pull-secret -n istio-system -o json | \
  jq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.annotations, .metadata.labels) | .metadata.namespace = "'$NAMESPACE'"' | \
  kubectl create -f -

kubectl patch serviceaccount default -n $NAMESPACE \
  -p '{"imagePullSecrets": [{"name": "redhat-pull-secret"}]}'

Download model and deploy

Clone the following repo:
git clone https://github.com/coreweave/doc-examples.git
Navigate to the redhat-inference directory:
cd doc-examples/cks/redhat-inference
Deploy the following files:
kubectl apply -f gpt-oss-pvc.yaml
kubectl apply -f download-job.yaml
After the download job completes, deploy the model:
kubectl apply -f deploy.yaml

Make inference request

In a separate terminal, port-forward the inference gateway:
kubectl port-forward svc/inference-gateway-istio 8080:80 -n opendatahub
Send a chat completion request to the endpoint:
curl http://localhost:8080/llm-d-rhaii/gpt-oss/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 50
  }'
You should receive a JSON response with the model output similar to the following:
Expected response
{"id":"chatcmpl-d66c36e3-3d96-4aa9-919d-***","object":"chat.completion","created":1773440761,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":"The user says \"Hello!\". Probably just a greeting. The assistant should respond with a friendly greeting and perhaps ask how can I help."},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":71,"total_tokens":119,"completion_tokens":48,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
You have now deployed a model for inference using Red Hat.

Next steps

Last modified on April 2, 2026