GPU Benchmark Comparison
Compare performance benchmarks between models and hardware
Inference benchmarks using various models are used to measure the performance of different GPU node types, in order to compare which GPU offers the best inference performance (the fastest inference times) for each model. The following benchmark comparisons offer some insight as to the inference performance of each node dependent on the model type, to aid in selection decisions.
As there is no "one size fits all" GPU, it is highly recommended that clients benchmark their own workloads in order to determine which GPU node type will be the most efficient for their specific use case.
For an overview of our GPU types, see the GPU Selection Guide.
Benchmark comparisons by model
The benchmark results in this document are reproducible when using an optimized environment. Please note that different versions of Pytorch or NVIDIA drivers may produce non-trivial discrepancies in benchmark results. Using CoreWeave's own ML container images, for example, can improve the performance of some workloads by 5-10%, as compared to the default NVIDIA container.
Benchmarks are organized in the following tables per model, with the best-performing (fastest) node types listed at the top of the column, while the slowest models are listed at the bottom.
Pythia 12B
Inference parameters used:
- Format: FP16
- Tokens: 1024 new tokens
- Input: Same input
- Number of samples: 25
GPU | Average time (seconds) | Tokens per second |
---|---|---|
A100 PCIe 80GB | 37.97 | 26.97 |
A100 PCIe 40GB | 40.31 | 25.4 |
A40 | 57.04 | 17.95 |
In this benchmark, using EleutherAI's Pythia 12B model, the A100 PCIe 80GB produced the fastest inference speeds, while the A40 produced the slowest.
GPT-NeoX 20B
Inference parameters used:
- Format: FP16
- Tokens: 1024 new tokens
- Input: Same input
- Number of samples: 25
GPU | Average time (seconds) | Tokens per second |
---|---|---|
A100 PCIe 80GB | 46.38 | 22.08 |
A40 | 79.88 | 12.82 |
In this benchmark, using EleutherAI's NeoX 20B model, the A100 PCIe 80GB produced the fastest inference speeds, while the A40 produced the slowest.
GPT-J 6B
Inference parameters used:
- Format: FP16
- Tokens: 512 new tokens
- Input: Same input
- Number of samples: 25
GPU | Average time (seconds) | Tokens per second |
---|---|---|
A100 PCIe 80GB | 17.50 | 29.26 |
A100 PCIe 40GB | 18.04 | 28.38 |
A40 | 21.39 | 23.94 |
In this benchmark, using EleutherAI's GPT-J 6B model, the A100 PCIe 80GB produced the fastest inference speeds, and the A40 produced the slowest.
SDXL 1.0 (Stable Diffusion XL 1.0)
- Format: FP16
- Size: 1024x1024
- Steps: 50
- Sample average: 10
GPU | Inference time (seconds) |
---|---|
A40 | 4.86 |
A100 PCIe 40GB | 6.85 |
A100 PCIe 80GB | 8.0 |
Results
In this benchmark, using the Stable Diffusion's XL 1.0 model, the A40 produced the fastest inference speeds, and the A100 PCIE 80GB produced the slowest.
HGX H100 benchmarks and comparisons
Check out CoreWeave's fork of the Optimum Habana trainer code. These H100 benchmarks may be reproduced by following the provided instructions in that repository.
The following benchmarks use Habana's Optimum Habana v1.7 trainer code to measure the performance of the NVIDIA HGX H100 against other GPU SKUs.
Number of GPUs | Model | Batch Size | Samples per second |
---|---|---|---|
8 | H100 80GB SXM | 54 | 962.5 |
8 | A100 80GB SXM | 54 | 524.2 |
8 | RTX A6000 | 27 | 205.4 |
In this benchmark, using the Stable Diffusion's XL 1.0 model, the H100 80GB SXM produced the fastest inference speeds, while the RTX A6000 produced the slowest.
Bandwidth and throughput Analysis
Using this same benchmark, we found it useful to perform the same test on 1x, 2x, 3x, 4x, 6x, and 8x accelerator configurations, as it reveals interesting insights into the impact of each architecture. This type of comparative measurement can highlight the effect of the interconnect - as well as the accelerator's memory bandwidth - on overall performance.
The following table displays the samples per second achieved using different accelerator configurations, including the amount by which the samples per second increased.
Model | 1 GPU | 2 GPUs | 3 GPUs | 4 GPUs | 6 GPUs | 8 GPUs |
---|---|---|---|---|---|---|
H100 80GB SXM | 142.3 | 275.0 (1.93x faster) | 400.6 (2.81x faster) | 521.8 (3.66x faster) | 740.3 (5.20x faster) | 962.2 (6.76x faster) |
A100 80GB SXM | 73.4 | 143.1 (1.94x faster) | 211.2 (2.88x faster) | 276.9 (3.77x faster) | 399.39 (5.44x faster) | 524.2 (7.14x faster) |
RTX A6000 (PCIe) | 32.5 | 59.3 (1.82x faster) | 86.8 (2.67x faster) | 113.9 (3.50x faster) | 157.05 (4.83x faster) | 205.4 (6.32x faster) |
Model | Interconnect | Memory bandwidth | TFLOPs | Power |
---|---|---|---|---|
H100 80GB SXM | SXM (900GB/s) | HBM3 (3.35 TB/s) | 267.6 (fp16) | 700W |
A100 80GB SXM | SXM (600GB/s) | HBM2e (2 TB/s) | 77.96 (fp16) | 400W |
RTX A6000 | PCIe (32Gb/s) | GDDR6 (768 GB/s) | 38.71 (fp16) | 300W |
The maximum scaling factor of the A100 80GB SXM (7.14x at 8 GPUs) and H100 80GB SXM (6.76x at 8 GPUs), compared to the RTX A6000 (PCIe), demonstrates the advantages of the SXM interconnect versus PCIe, which typically lands at a scaling factor of 5.5x-6.25x.
This means that at 8 GPUs, PCIe accelerators lose about 25% of their maximum theoretical performance, while SXM hosts only lose about 10%.
Conclusion
These benchmarks show there is no "one size fits all" GPU type - inference performance will vary widely based on the specifics of the use case. Selecting a GPU is not simply a matter of selecting which is the most powerful - weighing cost and performance is highly dependent upon the models used, techniques applied, and the ultimate goal of the customer.
Need help?
CoreWeave's expert support team is always happy to discuss your use case, and which models may be most appropriate for your goals and desired price points.