- Download the TensorRT-LLM example from the marimo-operator repository.
- Deploy to CKS with a single CLI command or a YAML manifest.
- Run inference against open-weight models including FP8-quantized checkpoints.
What you'll need
Before you start, you must have:
- A CKS cluster with an NVIDIA GPU Node (24 GB VRAM or greater recommended, 48 GB for larger models).
- The marimo operator installed on your cluster.
kubectlinstalled and configured to access your cluster.kubectl-marimoinstalled (uv tool install kubectl-marimo).
What you'll use
You use these tools and services:
- marimo-operator: Manages notebook deployments on Kubernetes.
- TensorRT-LLM: NVIDIA’s optimized LLM inference library.
- NVIDIA Triton container: Pre-built NGC image with TensorRT-LLM.
- kubectl-marimo: CLI plugin for running notebooks on Kubernetes.
Container imageThe example uses a purpose-built image layered on top of
nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3. It’s published at ghcr.io/marimo-team/marimo-operator/tensorrt:latest and rebuilt automatically on every push to main. To adapt it for your own workload, swap the FROM image or add your own packages.Get the example
The marimo-operator TensorRT example includes both a notebook file and a plain YAML manifest. Use whichever fits your workflow. Both options deploy the same workload. Choose Option A for an interactive notebook-first experience, or Option B if you prefer to manage Kubernetes resources declaratively.Option A: CLI plugin (interactive)
Download and deploy the notebook in one step. Replace[NAMESPACE] with the Kubernetes namespace you want to deploy into:
Option B: YAML manifest (declarative)
Apply the manifest directly if you prefer managing resources withkubectl:
nodeSelector, storage size, or GPU count before applying.
Select a model and run inference
The notebook opens with a model picker. The included models all fit comfortably within 48 GB of VRAM:| Model | VRAM (approx.) | Notes |
|---|---|---|
| TinyLlama 1.1B | ~3 GB | Fastest to load, good for testing |
| Phi-3.5-mini | ~8 GB | Strong reasoning, 128K context |
| Mistral 7B | ~14 GB | Solid general-purpose model |
| Llama-3.1 8B FP8 | ~8 GB | FP8-quantized by NVIDIA, ~50% less VRAM than FP16 |
| Minitron 8B | ~16 GB | Mistral-NeMo 12B distilled to 8B |
Clean up
When you’re done, remove the notebook deployment to release the GPU Node and avoid further charges. PressCtrl-C to stop the port-forward, then delete the notebook:
Additional resources
- TensorRT-LLM documentation: Quick start, LLM Python API reference, supported models.
- FP8 quantization guide: How NVIDIA’s FP8 calibration works and when to use it.
- NVIDIA NGC Triton container: The base container image used in this example.
- marimo-operator TensorRT example: Dockerfile, notebook, YAML manifest, and README.
- marimo notebooks on CKS: General setup guide for the marimo operator and CLI plugin.
- NVIDIA blog: Optimizing LLM inference with TensorRT-LLM: Overview of TensorRT-LLM’s optimization techniques.