Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

In this tutorial, you deploy an NVIDIA NIM on a CoreWeave Kubernetes Service (CKS) cluster. This tutorial assumes you are familiar with Kubernetes basics and want to run GPU-accelerated large language model (LLM) inference on CoreWeave. By the end, you have a running LLM inference endpoint and have queried it with the OpenAI-compatible chat completions API.

Prerequisites

Before you start, you must have:
  • A CKS cluster with at least one x86 (amd64) GPU node. NVIDIA NIM containers require x86 architecture. This tutorial uses an L40 GPU, but any NVIDIA GPU with 16 GB or more of VRAM works.
  • kubectl installed and connected to your cluster.
  • An NVIDIA NGC account and API key. To generate an API key, sign in to NGC and go to your profile dropdown > Setup > API Keys.

Set your NGC API key

Export your NGC API key as an environment variable. The commands in this tutorial reference this variable when creating Kubernetes Secrets.
export NGC_API_KEY="[YOUR-NGC-API-KEY]"

Verify cluster access

Confirm that kubectl can reach your cluster and that GPU nodes are available:
kubectl get nodes -o wide --no-headers | head -5
You should see output similar to the following:
g75d13e   Ready   <none>   209d   v1.32.4   10.176.196.197   <none>   Ubuntu 24.04.3 LTS   6.8.0-1022-nvidia-64k   containerd://1.7.2
Verify that at least one node has an NVIDIA GPU:
kubectl describe nodes | grep -E "^Name:|nvidia.com/gpu:"
You should see output similar to the following, with at least one node showing nvidia.com/gpu in its capacity:
Name:               g75d13e
  nvidia.com/gpu:     1

Create NGC secrets

NIM containers are hosted on NVIDIA’s NGC registry (nvcr.io). To authenticate with NGC, you must create two Kubernetes Secrets: one to pull the container image, and one for the NIM runtime to download model weights.
kubectl create secret docker-registry ngc-credentials \
    -n default \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password="$NGC_API_KEY"
You should see output similar to the following:
secret/ngc-credentials created
Create a second Secret to provide the NGC API key to the NIM runtime:
kubectl create secret generic ngc-api-key \
    -n default \
    --from-literal=NGC_API_KEY="$NGC_API_KEY"
You should see output similar to the following:
secret/ngc-api-key created
With the Secrets in place, the cluster can authenticate to NGC to pull the NIM container image and download model weights.

Deploy the NIM

With authentication configured, deploy the NIM itself. Save the following manifest as nim-hello-world.yaml. This manifest defines a single-replica Deployment with one GPU and a ClusterIP Service that exposes the inference API on port 8000.
nim-hello-world.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nim-hello-world
  namespace: default
  labels:
    app: nim-hello-world
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nim-hello-world
  template:
    metadata:
      labels:
        app: nim-hello-world
    spec:
      nodeSelector:
        # Update this value to match an available GPU class in your cluster.
        # Run: kubectl get nodes -o jsonpath='{.items[*].metadata.labels.gpu\.nvidia\.com/class}'
        gpu.nvidia.com/class: L40
      imagePullSecrets:
        - name: ngc-credentials
      containers:
        - name: nim
          image: nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: NGC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: ngc-api-key
                  key: NGC_API_KEY
          resources:
            limits:
              nvidia.com/gpu: 1
              cpu: "4"
              memory: "32Gi"
            requests:
              nvidia.com/gpu: 1
              cpu: "2"
              memory: "16Gi"
          volumeMounts:
            - name: nim-cache
              mountPath: /opt/nim/.cache
          readinessProbe:
            httpGet:
              path: /v1/health/ready
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /v1/health/live
              port: 8000
            initialDelaySeconds: 300
            periodSeconds: 15
      volumes:
        - name: nim-cache
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: nim-hello-world
  namespace: default
spec:
  selector:
    app: nim-hello-world
  ports:
    - name: http
      port: 8000
      targetPort: 8000
  type: ClusterIP
If your cluster uses a different GPU type, adjust the nodeSelector to match. Use kubectl get nodes -o jsonpath='{.items[*].metadata.labels.gpu\.nvidia\.com/class}' to see available GPU classes.
Apply the manifest to create the Deployment and Service:
kubectl apply -f nim-hello-world.yaml
You should see output similar to the following:
deployment.apps/nim-hello-world created
service/nim-hello-world created

Wait for the NIM to become ready

On first start, the NIM container downloads model weights. This may take several minutes depending on model size and network speed. To check Pod status, list the Pods with the nim-hello-world label:
kubectl get pods -n default -l app=nim-hello-world --no-headers
You should see output similar to the following once the model is loaded:
nim-hello-world-7c66bb6cf-78p4c   1/1   Running   0     75s
If the Pod restarts during startup, the liveness probe may trigger before the model finishes loading. Increase initialDelaySeconds on the liveness probe, or check logs with kubectl logs -n default -l app=nim-hello-world --tail=50.
Now that the NIM is running and healthy, you can send it inference requests.

Query the NIM

Send a chat completion request to the NIM. The API is compatible with the OpenAI Chat Completions format. Because the Service type is ClusterIP, the endpoint is only reachable from within the cluster. The following command uses a temporary Pod to send the request:
kubectl run curl-test --rm -i --restart=Never \
    --image=curlimages/curl -- \
    curl -s http://nim-hello-world.default.svc.cluster.local:8000/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "meta/llama3-8b-instruct",
      "messages": [{"role": "user", "content": "What is CoreWeave in one sentence?"}],
      "max_tokens": 128
    }'
You should see a JSON response similar to the following:
{
  "id": "cmpl-89a23df5d11b44f29008e04095f1da0d",
  "object": "chat.completion",
  "model": "meta/llama3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "CoreWeave is a cloud-based HPC (High-Performance Computing) platform that enables researchers, scientists, and innovators to run demanding computational workloads, such as genomics, climate modeling, and artificial intelligence, using advanced technologies like AI, GPU acceleration, and multi-cloud deployment."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 19,
    "total_tokens": 78,
    "completion_tokens": 59
  }
}
At this point, you have a working NIM inference endpoint running on your CKS cluster. You have authenticated to NGC, deployed the model, and confirmed it responds to chat completion requests.

Clean up

When you’re done, clean up your credentials and the Kubernetes manifest.
kubectl delete -f nim-hello-world.yaml --ignore-not-found
kubectl delete secret ngc-credentials ngc-api-key -n default --ignore-not-found

Next steps

Now that you have a working NIM deployment, explore these options:
  • Try a different model. Replace the image and model name in the manifest to deploy any NIM-supported model.
  • Add autoscaling. See the Deploy vLLM for Inference tutorial for patterns using KEDA and Prometheus-based autoscaling.
  • Expose externally. Add an Ingress or LoadBalancer Service to serve inference requests from outside the cluster.
Last modified on April 6, 2026