TensorRT-LLM is NVIDIA’s open-source library for optimizing and running large language model inference on NVIDIA GPUs. It fuses operations, applies quantization, and compiles models into optimized TensorRT engines, delivering significantly faster throughput and lower latency compared to standard PyTorch inference. CKS makes it straightforward to run TensorRT-LLM workloads: pull the official NVIDIA Triton NGC container, add whatever tooling your workflow needs, and deploy to a GPU node. This tutorial demonstrates that pattern through a marimo notebook example, an interactive notebook with a live model picker and prompt selector, but the same container approach works for any TensorRT-LLM workload. In this tutorial, you will:Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
- Download the TensorRT-LLM example from the marimo-operator repository
- Deploy it to CKS with a single CLI command or a YAML manifest
- Run inference against open-weight models including FP8-quantized checkpoints
What you'll need
Before you start, you must have:
- A CKS cluster with an NVIDIA GPU node (≥ 24 GB VRAM recommended; 48 GB for larger models)
- The marimo operator installed on your cluster
kubectlinstalled and configured to access your clusterkubectl-marimoinstalled (uv tool install kubectl-marimo)
What you'll use
You’ll use these tools and services:
- marimo-operator: Manages notebook deployments on Kubernetes
- TensorRT-LLM: NVIDIA’s optimized LLM inference library
- NVIDIA Triton container: Pre-built NGC image with TensorRT-LLM
- kubectl-marimo: CLI plugin for running notebooks on Kubernetes
Container imageThe example uses a purpose-built image layered on top of
nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3. It is published at ghcr.io/marimo-team/marimo-operator/tensorrt:latest and rebuilt automatically on every push to main. Adapting it for your own workload is as simple as swapping the FROM image or adding your own packages.Get the example
The marimo-operator TensorRT example includes both a notebook file and a plain YAML manifest. Use whichever fits your workflow.Option A: CLI plugin (interactive)
Download and deploy the notebook in one step:Option B: YAML manifest (declarative)
Apply the manifest directly if you prefer managing resources withkubectl:
nodeSelector, storage size, or GPU count before applying.
Select a model and run inference
The notebook opens with a model picker. The included models all fit comfortably within 48 GB of VRAM:| Model | VRAM (approx.) | Notes |
|---|---|---|
| TinyLlama 1.1B | ~3 GB | Fastest to load; good for testing |
| Phi-3.5-mini | ~8 GB | Strong reasoning, 128K context |
| Mistral 7B | ~14 GB | Solid general-purpose model |
| Llama-3.1 8B FP8 | ~8 GB | FP8-quantized by NVIDIA; ~50% less VRAM than FP16 |
| Minitron 8B | ~16 GB | Mistral-NeMo 12B distilled to 8B |
Clean up
PressCtrl-C to stop the port-forward, then delete the notebook:
Additional resources
- TensorRT-LLM documentation: Quick start, LLM Python API reference, supported models
- FP8 quantization guide: How NVIDIA’s FP8 calibration works and when to use it
- NVIDIA NGC Triton container: The base container image used in this example
- marimo-operator TensorRT example: Dockerfile, notebook, YAML manifest, and README
- marimo notebooks on CKS: General setup guide for the marimo operator and CLI plugin
- NVIDIA blog: Optimizing LLM inference with TensorRT-LLM: Overview of TensorRT-LLM’s optimization techniques