Skip to main content
TensorRT-LLM is NVIDIA’s open-source library for optimizing and running large language model inference on NVIDIA GPUs. It fuses operations, applies quantization, and compiles models into optimized TensorRT engines, delivering higher throughput and lower latency compared to standard PyTorch inference. CKS makes it straightforward to run TensorRT-LLM workloads: pull the official NVIDIA Triton NGC container, add whatever tooling your workflow needs, and deploy to a GPU Node. This tutorial demonstrates that pattern through a marimo notebook example, an interactive notebook with a live model picker and prompt selector, but the same container approach works for any TensorRT-LLM workload. By the end, you have a working notebook environment that can load and serve multiple open-weight models for low-latency inference on a CKS GPU Node. In this tutorial, you:
  1. Download the TensorRT-LLM example from the marimo-operator repository.
  2. Deploy to CKS with a single CLI command or a YAML manifest.
  3. Run inference against open-weight models including FP8-quantized checkpoints.

What you'll need

Before you start, you must have:
  • A CKS cluster with an NVIDIA GPU Node (24 GB VRAM or greater recommended, 48 GB for larger models).
  • The marimo operator installed on your cluster.
  • kubectl installed and configured to access your cluster.
  • kubectl-marimo installed (uv tool install kubectl-marimo).

What you'll use

You use these tools and services:
Container imageThe example uses a purpose-built image layered on top of nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3. It’s published at ghcr.io/marimo-team/marimo-operator/tensorrt:latest and rebuilt automatically on every push to main. To adapt it for your own workload, swap the FROM image or add your own packages.

Get the example

The marimo-operator TensorRT example includes both a notebook file and a plain YAML manifest. Use whichever fits your workflow. Both options deploy the same workload. Choose Option A for an interactive notebook-first experience, or Option B if you prefer to manage Kubernetes resources declaratively.

Option A: CLI plugin (interactive)

Download and deploy the notebook in one step. Replace [NAMESPACE] with the Kubernetes namespace you want to deploy into:
curl -O https://raw.githubusercontent.com/marimo-team/marimo-operator/main/examples/tensorrt/tensorrt.py
kubectl marimo edit tensorrt.py --namespace [NAMESPACE]
The plugin reads the Kubernetes config embedded in the notebook’s PEP 723 header (image, GPU limits, storage size) and generates the manifest for you.
Waiting for tensorrt to be ready...
Opening http://localhost:2718?access_token=<TOKEN>
Press Ctrl-C to stop port-forward and sync changes

Option B: YAML manifest (declarative)

Apply the manifest directly if you prefer managing resources with kubectl:
kubectl apply -f https://raw.githubusercontent.com/marimo-team/marimo-operator/main/examples/tensorrt/tensorrt.yaml --namespace [NAMESPACE]
Edit the manifest to adjust the nodeSelector, storage size, or GPU count before applying.

Select a model and run inference

The notebook opens with a model picker. The included models all fit comfortably within 48 GB of VRAM:
ModelVRAM (approx.)Notes
TinyLlama 1.1B~3 GBFastest to load, good for testing
Phi-3.5-mini~8 GBStrong reasoning, 128K context
Mistral 7B~14 GBSolid general-purpose model
Llama-3.1 8B FP8~8 GBFP8-quantized by NVIDIA, ~50% less VRAM than FP16
Minitron 8B~16 GBMistral-NeMo 12B distilled to 8B
Select a model, then pick a prompt from the second dropdown. The generation cell is reactive: changing the prompt re-runs inference without reloading the model. You now have a TensorRT-LLM-optimized model loaded in your CKS-hosted notebook and can iterate on prompts interactively against GPU-accelerated inference.

Clean up

When you’re done, remove the notebook deployment to release the GPU Node and avoid further charges. Press Ctrl-C to stop the port-forward, then delete the notebook:
kubectl marimo delete tensorrt --namespace [NAMESPACE]
If you used the YAML manifest, run:
kubectl delete -f tensorrt.yaml --namespace [NAMESPACE]

Additional resources

Last modified on June 10, 2026