Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

TensorRT-LLM is NVIDIA’s open-source library for optimizing and running large language model inference on NVIDIA GPUs. It fuses operations, applies quantization, and compiles models into optimized TensorRT engines, delivering significantly faster throughput and lower latency compared to standard PyTorch inference. CKS makes it straightforward to run TensorRT-LLM workloads: pull the official NVIDIA Triton NGC container, add whatever tooling your workflow needs, and deploy to a GPU node. This tutorial demonstrates that pattern through a marimo notebook example, an interactive notebook with a live model picker and prompt selector, but the same container approach works for any TensorRT-LLM workload. In this tutorial, you will:
  1. Download the TensorRT-LLM example from the marimo-operator repository
  2. Deploy it to CKS with a single CLI command or a YAML manifest
  3. Run inference against open-weight models including FP8-quantized checkpoints

What you'll need

Before you start, you must have:
  • A CKS cluster with an NVIDIA GPU node (≥ 24 GB VRAM recommended; 48 GB for larger models)
  • The marimo operator installed on your cluster
  • kubectl installed and configured to access your cluster
  • kubectl-marimo installed (uv tool install kubectl-marimo)

What you'll use

You’ll use these tools and services:
Container imageThe example uses a purpose-built image layered on top of nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3. It is published at ghcr.io/marimo-team/marimo-operator/tensorrt:latest and rebuilt automatically on every push to main. Adapting it for your own workload is as simple as swapping the FROM image or adding your own packages.

Get the example

The marimo-operator TensorRT example includes both a notebook file and a plain YAML manifest. Use whichever fits your workflow.

Option A: CLI plugin (interactive)

Download and deploy the notebook in one step:
curl -O https://raw.githubusercontent.com/marimo-team/marimo-operator/main/examples/tensorrt/tensorrt.py
kubectl marimo edit tensorrt.py --namespace NAMESPACE
The plugin reads the Kubernetes config embedded in the notebook’s PEP 723 header (image, GPU limits, storage size) and generates the manifest for you.
Waiting for tensorrt to be ready...
Opening http://localhost:2718?access_token=<TOKEN>
Press Ctrl+C to stop port-forward and sync changes

Option B: YAML manifest (declarative)

Apply the manifest directly if you prefer managing resources with kubectl:
kubectl apply -f https://raw.githubusercontent.com/marimo-team/marimo-operator/main/examples/tensorrt/tensorrt.yaml --namespace NAMESPACE
Edit the manifest to adjust the nodeSelector, storage size, or GPU count before applying.

Select a model and run inference

The notebook opens with a model picker. The included models all fit comfortably within 48 GB of VRAM:
ModelVRAM (approx.)Notes
TinyLlama 1.1B~3 GBFastest to load; good for testing
Phi-3.5-mini~8 GBStrong reasoning, 128K context
Mistral 7B~14 GBSolid general-purpose model
Llama-3.1 8B FP8~8 GBFP8-quantized by NVIDIA; ~50% less VRAM than FP16
Minitron 8B~16 GBMistral-NeMo 12B distilled to 8B
Select a model, then pick a prompt from the second dropdown. The generation cell is reactive, changing the prompt re-runs inference without reloading the model.

Clean up

Press Ctrl-C to stop the port-forward, then delete the notebook:
kubectl marimo delete tensorrt --namespace NAMESPACE
Or if you used the YAML manifest:
kubectl delete -f tensorrt.yaml --namespace NAMESPACE

Additional resources

Last modified on April 20, 2026