> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Select an instance: key considerations for LLMs

> Learn about the key considerations for selecting the right instance for your LLM workload

The primary technical constraint for running any large language model (LLM) is the available GPU memory. For a model to operate efficiently, all of its components, including its parameters, the data being processed, and intermediate calculation states, must fit into the memory of one or more GPUs. Understanding how these components consume memory is crucial for selecting the right hardware for your task, whether it's training, fine-tuning, or inference.

## Model parameters (weights)

At its core, an LLM consists of billions of numerical values, known as parameters or weights. These parameters are the learned knowledge of the model. The total memory required to store them is a direct function of the number of parameters and the numerical precision used.

## Precision and quantization

Precision refers to the data type used to store the model's parameters. Using lower precision reduces the memory footprint and can increase computation speed on supported hardware.

* **FP16 (half precision):** The standard for model training and fine-tuning. It provides a good balance between numerical accuracy and memory usage.
* **Quantization (INT8 and INT4):** This technique, primarily used for inference, converts the model's FP16 weights into lower-precision 8-bit or 4-bit integers. This process compresses the model, letting you run a larger model on the same hardware or the same model on less expensive hardware. The trade-off is a minor, and often imperceptible, reduction in accuracy, which is generally acceptable for production inference workloads.

Use the following formulas to estimate the GPU memory required to load the model's weights:

* **For FP16:** Model size (in billions) × 2.
* **For INT8:** Model size (in billions) × 1.
* **For INT4:** Model size (in billions) × 0.5.

## The impact of the KV cache (for inference)

During inference, the key-value (KV) cache consumes a sizable portion of GPU memory. To generate new text, an LLM must pay attention to the preceding tokens. The KV cache stores these intermediate attention calculations so they don't have to be recomputed for every new word generated. While this speeds up inference, the cache's memory footprint is dynamic and grows with the complexity of the request. Its size scales linearly with both the sequence length (the size of the input prompt plus the generated text) and the batch size (the number of concurrent requests the system processes). Managing the KV cache is a critical aspect of optimizing for high-throughput inference with long context windows.

## GPU memory for training

Optimizer states and gradients make training or fine-tuning a model far more memory-intensive than inference. In addition to the model weights and data batches, the GPU must also store:

* **Optimizer states:** Most modern optimizers, like Adam or AdamW, maintain momentum and variance states for each parameter to stabilize and accelerate training. This can consume twice as much memory as the model's parameters themselves.
* **Gradients:** The calculated directions for updating each model weight. The memory required is typically equal to the size of the model parameters in FP16.

As a result, fine-tuning a model can require 3x to 4x more GPU memory than running it for inference.

## The role of NVLink and InfiniBand

Running models larger than a single GPU's memory requires high-speed interconnects to link GPUs together, a practice known as model parallelism. The performance of this approach depends on a two-level fabric:

* **The GPU fabric (NVLink):** A high-speed, direct link between GPUs. In standard servers like our B200 or H100 instances, NVLink connects the 8 GPUs within the server chassis. Our Grace Blackwell NVL72 instances (`gb200-4x`) extend this design by using an NVLink Switch fabric to span the memory domain across all 72 GPUs in an entire rack, turning the rack into a single GPU.
* **The system fabric (InfiniBand):** The high-performance network that scales your workload further. For standard servers, InfiniBand connects multiple Nodes together. For Grace Blackwell NVL72 systems, its role expands to connecting multiple racks, which lets you build large AI supercomputers from these rack-scale units.

## Instance comparison

The following table summarizes the available GPU instances, their total memory, and the model sizes each configuration can support, to help you match an instance to your workload.

| GPU                                                             | System Config   | Total GPU Memory | Max Inference Model Size (FP8)<sup>\*</sup> | Max Training Model Size (BF16) | Examples of Models That Fit                                                                                                                              |
| --------------------------------------------------------------- | --------------- | ---------------- | ------------------------------------------- | ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [GB300](/platform/instances/gpu/gb300-4x)                       | NVL72 (72 GPUs) | 20.7 TB          | \~10.3 Trillion                             | \~1.3 Trillion                 | **Inference:** Next-generation, multi-trillion parameter models at high fidelity<br />**Training:** Foundation models in the 1T+ parameter range         |
| [GB300](/platform/instances/gpu/gb300-4x)                       | NVL64 (64 GPUs) | 18.4 TB          | \~9.2 Trillion                              | \~1.15 Trillion                | **Inference:** Very large frontier models at high fidelity<br />**Training:** Efficiently training 1T parameter models from scratch                      |
| [GB200](/platform/instances/gpu/gb200-4x)                       | NVL72 (72 GPUs) | 13.8 TB          | \~13.8 Trillion                             | \~860 Billion                  | **Inference:** GPT-5, Claude Opus, Gemini Ultra, other large proprietary models<br />**Training:** Nemotron-4 340B, Llama 3.1 405B (with plenty of room) |
| [GB200](/platform/instances/gpu/gb200-4x)                       | NVL64 (64 GPUs) | 12.3 TB          | \~12.3 Trillion                             | \~768 Billion                  | **Inference:** GPT-5, Claude Opus, Gemini Ultra, other large proprietary models<br />**Training:** Llama 3.1 405B                                        |
| [B200](/platform/instances/gpu/b200-8x)                         | 8-GPU system    | 1.54 TB          | \~1.54 Trillion                             | \~96 Billion                   | **Inference:** Llama 3.1 405B, Nemotron-4 340B, Gemini Pro<br />**Training:** Llama 3 70B, Mixtral 8x7B                                                  |
| [H200](/platform/instances/gpu/gd-8xh200ib-i128)                | 8-GPU system    | 1.13 TB          | \~1.13 Trillion                             | \~70 Billion                   | **Inference:** Falcon-180B, models up to \~1T<br />**Training:** Llama 3 70B, other models up to \~70B from scratch                                      |
| [H100](/platform/instances/gpu/gd-8xh100ib-i128)                | 8-GPU system    | 640 GB           | \~640 Billion                               | \~40 Billion                   | **Inference:** Llama 3.1 405B, Nemotron-4 340B<br />**Training:** Qwen2 32B, fine-tuning large models                                                    |
| [RTX Pro 6000 (Blackwell)](/platform/instances/gpu/rtxp6000-8x) | 1 GPU           | 96 GB            | \~86 Billion                                | \~6 Billion                    | **Inference:** Llama 3 70B, Mixtral 8x7B<br />**Training:** Models up to \~6B from scratch, fine-tuning larger models                                    |
| [L40S](/platform/instances/gpu/gd-8xl40s-i128)                  | 1 GPU           | 48 GB            | \~43 Billion                                | \~3 Billion                    | **Inference:** Qwen2 32B, LLaVA 34B<br />**Training:** Small custom models (\~2B), efficient fine-tuning                                                 |
| [L40](/platform/instances/gpu/gd-8xl40-i128)                    | 1 GPU           | 48 GB            | \~43 Billion                                | \~3 Billion                    | **Inference:** Qwen2 32B, LLaVA 34B<br />**Training:** Small custom models (\~2B), fine-tuning                                                           |
| [GH200](/platform/instances/gpu/gd-1xgh200)                     | 1 GPU           | 576 GB           | \~576 Billion                               | \~6 Billion                    | **Inference:** Llama 3.1 405B, Nemotron-4 340B<br />**Training:** Small custom models (\~6B)                                                             |
| [A100](/platform/instances/gpu/gd-8xa100-i128)                  | 8-GPU system    | 640 GB           | \~640 Billion                               | \~40 Billion                   | **Inference:** Llama 3.1 405B, Falcon-180B<br />**Training:** Qwen2 32B, fine-tuning large models                                                        |

<sup>\*</sup>Figures represent theoretical estimates for a single model. A typical production scenario often involves running multiple, smaller models concurrently.

<Info>
  Learn more about our [GPU instances](/platform/instances/gpu-instances), [CPU instances](/platform/instances/cpu-instances), and [NVL72-powered instances](/platform/instances/nvl72).
</Info>