Run TensorRT-LLM inference with marimo notebooks

TensorRT-LLM is NVIDIA’s open-source library for optimizing and running large language model inference on NVIDIA GPUs. It fuses operations, applies quantization, and compiles models into optimized TensorRT engines, delivering significantly faster throughput and lower latency compared to standard PyTorch inference. CKS makes it straightforward to run TensorRT-LLM workloads: pull the official NVIDIA Triton NGC container, add whatever tooling your workflow needs, and deploy to a GPU node. This tutorial demonstrates that pattern through a marimo notebook example, an interactive notebook with a live model picker and prompt selector, but the same container approach works for any TensorRT-LLM workload. In this tutorial, you will:

Download the TensorRT-LLM example from the marimo-operator repository
Deploy it to CKS with a single CLI command or a YAML manifest
Run inference against open-weight models including FP8-quantized checkpoints

What you'll need

Before you start, you must have:

A CKS cluster with an NVIDIA GPU node (≥ 24 GB VRAM recommended; 48 GB for larger models)
The marimo operator installed on your cluster
kubectl installed and configured to access your cluster
kubectl-marimo installed (uv tool install kubectl-marimo)

What you'll use

You’ll use these tools and services:

marimo-operator: Manages notebook deployments on Kubernetes
TensorRT-LLM: NVIDIA’s optimized LLM inference library
NVIDIA Triton container: Pre-built NGC image with TensorRT-LLM
kubectl-marimo: CLI plugin for running notebooks on Kubernetes

Container imageThe example uses a purpose-built image layered on top of nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3. It is published at ghcr.io/marimo-team/marimo-operator/tensorrt:latest and rebuilt automatically on every push to main. Adapting it for your own workload is as simple as swapping the FROM image or adding your own packages.

Get the example

The marimo-operator TensorRT example includes both a notebook file and a plain YAML manifest. Use whichever fits your workflow.

Option A: CLI plugin (interactive)

Download and deploy the notebook in one step:

curl -O https://raw.githubusercontent.com/marimo-team/marimo-operator/main/examples/tensorrt/tensorrt.py
kubectl marimo edit tensorrt.py --namespace NAMESPACE

The plugin reads the Kubernetes config embedded in the notebook’s PEP 723 header (image, GPU limits, storage size) and generates the manifest for you.

Waiting for tensorrt to be ready...
Opening http://localhost:2718?access_token=<TOKEN>
Press Ctrl+C to stop port-forward and sync changes

Option B: YAML manifest (declarative)

Apply the manifest directly if you prefer managing resources with kubectl:

kubectl apply -f https://raw.githubusercontent.com/marimo-team/marimo-operator/main/examples/tensorrt/tensorrt.yaml --namespace NAMESPACE

Edit the manifest to adjust the nodeSelector, storage size, or GPU count before applying.

Select a model and run inference

The notebook opens with a model picker. The included models all fit comfortably within 48 GB of VRAM:

Model	VRAM (approx.)	Notes
TinyLlama 1.1B	~3 GB	Fastest to load; good for testing
Phi-3.5-mini	~8 GB	Strong reasoning, 128K context
Mistral 7B	~14 GB	Solid general-purpose model
Llama-3.1 8B FP8	~8 GB	FP8-quantized by NVIDIA; ~50% less VRAM than FP16
Minitron 8B	~16 GB	Mistral-NeMo 12B distilled to 8B

Select a model, then pick a prompt from the second dropdown. The generation cell is reactive, changing the prompt re-runs inference without reloading the model.

Clean up

Press Ctrl-C to stop the port-forward, then delete the notebook:

kubectl marimo delete tensorrt --namespace NAMESPACE

Or if you used the YAML manifest:

kubectl delete -f tensorrt.yaml --namespace NAMESPACE

Additional resources

TensorRT-LLM documentation: Quick start, LLM Python API reference, supported models
FP8 quantization guide: How NVIDIA’s FP8 calibration works and when to use it
NVIDIA NGC Triton container: The base container image used in this example
marimo-operator TensorRT example: Dockerfile, notebook, YAML manifest, and README
marimo notebooks on CKS: General setup guide for the marimo operator and CLI plugin
NVIDIA blog: Optimizing LLM inference with TensorRT-LLM: Overview of TensorRT-LLM’s optimization techniques

CoreWeave Kubernetes Service

Documentation Index

What you'll need

What you'll use

​Get the example

​Option A: CLI plugin (interactive)

​Option B: YAML manifest (declarative)

​Select a model and run inference

​Clean up

​Additional resources

Get the example

Option A: CLI plugin (interactive)

Option B: YAML manifest (declarative)

Select a model and run inference

Clean up

Additional resources