Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
In this tutorial, you deploy an NVIDIA NIM on a CoreWeave Kubernetes Service (CKS) cluster. This tutorial assumes you are familiar with Kubernetes basics and want to run GPU-accelerated large language model (LLM) inference on CoreWeave. By the end, you have a running LLM inference endpoint and have queried it with the OpenAI-compatible chat completions API.
Prerequisites
Before you start, you must have:
- A CKS cluster with at least one x86 (amd64) GPU node. NVIDIA NIM containers require x86 architecture. This tutorial uses an L40 GPU, but any NVIDIA GPU with 16 GB or more of VRAM works.
- kubectl installed and connected to your cluster.
- An NVIDIA NGC account and API key. To generate an API key, sign in to NGC and go to your profile dropdown > Setup > API Keys.
Set your NGC API key
Export your NGC API key as an environment variable. The commands in this tutorial reference this variable when creating Kubernetes Secrets.
export NGC_API_KEY="[YOUR-NGC-API-KEY]"
Verify cluster access
Confirm that kubectl can reach your cluster and that GPU nodes are available:
kubectl get nodes -o wide --no-headers | head -5
You should see output similar to the following:
g75d13e Ready <none> 209d v1.32.4 10.176.196.197 <none> Ubuntu 24.04.3 LTS 6.8.0-1022-nvidia-64k containerd://1.7.2
Verify that at least one node has an NVIDIA GPU:
kubectl describe nodes | grep -E "^Name:|nvidia.com/gpu:"
You should see output similar to the following, with at least one node showing nvidia.com/gpu in its capacity:
Name: g75d13e
nvidia.com/gpu: 1
Create NGC secrets
NIM containers are hosted on NVIDIA’s NGC registry (nvcr.io). To authenticate with NGC, you must create two Kubernetes Secrets: one to pull the container image, and one for the NIM runtime to download model weights.
kubectl create secret docker-registry ngc-credentials \
-n default \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password="$NGC_API_KEY"
You should see output similar to the following:
secret/ngc-credentials created
Create a second Secret to provide the NGC API key to the NIM runtime:
kubectl create secret generic ngc-api-key \
-n default \
--from-literal=NGC_API_KEY="$NGC_API_KEY"
You should see output similar to the following:
secret/ngc-api-key created
With the Secrets in place, the cluster can authenticate to NGC to pull the NIM container image and download model weights.
Deploy the NIM
With authentication configured, deploy the NIM itself. Save the following manifest as nim-hello-world.yaml. This manifest defines a single-replica Deployment with one GPU and a ClusterIP Service that exposes the inference API on port 8000.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nim-hello-world
namespace: default
labels:
app: nim-hello-world
spec:
replicas: 1
selector:
matchLabels:
app: nim-hello-world
template:
metadata:
labels:
app: nim-hello-world
spec:
nodeSelector:
# Update this value to match an available GPU class in your cluster.
# Run: kubectl get nodes -o jsonpath='{.items[*].metadata.labels.gpu\.nvidia\.com/class}'
gpu.nvidia.com/class: L40
imagePullSecrets:
- name: ngc-credentials
containers:
- name: nim
image: nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
ports:
- containerPort: 8000
name: http
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: ngc-api-key
key: NGC_API_KEY
resources:
limits:
nvidia.com/gpu: 1
cpu: "4"
memory: "32Gi"
requests:
nvidia.com/gpu: 1
cpu: "2"
memory: "16Gi"
volumeMounts:
- name: nim-cache
mountPath: /opt/nim/.cache
readinessProbe:
httpGet:
path: /v1/health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /v1/health/live
port: 8000
initialDelaySeconds: 300
periodSeconds: 15
volumes:
- name: nim-cache
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: nim-hello-world
namespace: default
spec:
selector:
app: nim-hello-world
ports:
- name: http
port: 8000
targetPort: 8000
type: ClusterIP
If your cluster uses a different GPU type, adjust the nodeSelector to match. Use kubectl get nodes -o jsonpath='{.items[*].metadata.labels.gpu\.nvidia\.com/class}' to see available GPU classes.
Apply the manifest to create the Deployment and Service:
kubectl apply -f nim-hello-world.yaml
You should see output similar to the following:
deployment.apps/nim-hello-world created
service/nim-hello-world created
Wait for the NIM to become ready
On first start, the NIM container downloads model weights. This may take several minutes depending on model size and network speed.
To check Pod status, list the Pods with the nim-hello-world label:
kubectl get pods -n default -l app=nim-hello-world --no-headers
You should see output similar to the following once the model is loaded:
nim-hello-world-7c66bb6cf-78p4c 1/1 Running 0 75s
If the Pod restarts during startup, the liveness probe may trigger before the model finishes loading. Increase initialDelaySeconds on the liveness probe, or check logs with kubectl logs -n default -l app=nim-hello-world --tail=50.
Now that the NIM is running and healthy, you can send it inference requests.
Query the NIM
Send a chat completion request to the NIM. The API is compatible with the OpenAI Chat Completions format.
Because the Service type is ClusterIP, the endpoint is only reachable from within the cluster. The following command uses a temporary Pod to send the request:
kubectl run curl-test --rm -i --restart=Never \
--image=curlimages/curl -- \
curl -s http://nim-hello-world.default.svc.cluster.local:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama3-8b-instruct",
"messages": [{"role": "user", "content": "What is CoreWeave in one sentence?"}],
"max_tokens": 128
}'
You should see a JSON response similar to the following:
{
"id": "cmpl-89a23df5d11b44f29008e04095f1da0d",
"object": "chat.completion",
"model": "meta/llama3-8b-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "CoreWeave is a cloud-based HPC (High-Performance Computing) platform that enables researchers, scientists, and innovators to run demanding computational workloads, such as genomics, climate modeling, and artificial intelligence, using advanced technologies like AI, GPU acceleration, and multi-cloud deployment."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 19,
"total_tokens": 78,
"completion_tokens": 59
}
}
At this point, you have a working NIM inference endpoint running on your CKS cluster. You have authenticated to NGC, deployed the model, and confirmed it responds to chat completion requests.
Clean up
When you’re done, clean up your credentials and the Kubernetes manifest.
kubectl delete -f nim-hello-world.yaml --ignore-not-found
kubectl delete secret ngc-credentials ngc-api-key -n default --ignore-not-found
Next steps
Now that you have a working NIM deployment, explore these options:
- Try a different model. Replace the image and model name in the manifest to deploy any NIM-supported model.
- Add autoscaling. See the Deploy vLLM for Inference tutorial for patterns using KEDA and Prometheus-based autoscaling.
- Expose externally. Add an Ingress or LoadBalancer Service to serve inference requests from outside the cluster.