GPU Benchmark Comparison

Compare performance benchmarks between models and hardware

Inference benchmarks using various models are used to measure the performance of different GPU node types, in order to compare which GPU offers the best inference performance (the fastest inference times) for each model. The following benchmark comparisons offer some insight as to the inference performance of each node dependent on the model type, to aid in selection decisions.

Note

As there is no "one size fits all" GPU, it is highly recommended that clients benchmark their own workloads in order to determine which GPU node type will be the most efficient for their specific use case.

For an overview of our GPU types, see the GPU Selection Guide.

Benchmark comparisons by model

Important

The benchmark results in this document are reproducible when using an optimized environment. Please note that different versions of Pytorch or NVIDIA drivers may produce non-trivial discrepancies in benchmark results. Using CoreWeave's own ML container images, for example, can improve the performance of some workloads by 5-10%, as compared to the default NVIDIA container.

Benchmarks are organized in the following tables per model, with the best-performing (fastest) node types listed at the top of the column, while the slowest models are listed at the bottom.

Pythia 12B

Inference parameters used:

Format: FP16
Tokens: 1024 new tokens
Input: Same input
Number of samples: 25

GPU	Average time (seconds)	Tokens per second
A100 PCIe 80GB	37.97	26.97
A100 PCIe 40GB	40.31	25.4
A40	57.04	17.95

In this benchmark, using EleutherAI's Pythia 12B model, the A100 PCIe 80GB produced the fastest inference speeds, while the A40 produced the slowest.

GPT-NeoX 20B

Inference parameters used:

Format: FP16
Tokens: 1024 new tokens
Input: Same input
Number of samples: 25

GPU	Average time (seconds)	Tokens per second
A100 PCIe 80GB	46.38	22.08
A40	79.88	12.82

In this benchmark, using EleutherAI's NeoX 20B model, the A100 PCIe 80GB produced the fastest inference speeds, while the A40 produced the slowest.

GPT-J 6B

Inference parameters used:

Format: FP16
Tokens: 512 new tokens
Input: Same input
Number of samples: 25

GPU	Average time (seconds)	Tokens per second
A100 PCIe 80GB	17.50	29.26
A100 PCIe 40GB	18.04	28.38
A40	21.39	23.94

In this benchmark, using EleutherAI's GPT-J 6B model, the A100 PCIe 80GB produced the fastest inference speeds, and the A40 produced the slowest.

SDXL 1.0 (Stable Diffusion XL 1.0)

Format: FP16
Size: 1024x1024
Steps: 50
Sample average: 10

GPU	Inference time (seconds)
A40	4.86
A100 PCIe 40GB	6.85
A100 PCIe 80GB	8.0

Results

In this benchmark, using the Stable Diffusion's XL 1.0 model, the A40 produced the fastest inference speeds, and the A100 PCIE 80GB produced the slowest.

HGX H100 benchmarks and comparisons

Tip

Check out CoreWeave's fork of the Optimum Habana trainer code. These H100 benchmarks may be reproduced by following the provided instructions in that repository.

The following benchmarks use Habana's Optimum Habana v1.7 trainer code to measure the performance of the NVIDIA HGX H100 against other GPU SKUs.

Number of GPUs	Model	Batch Size	Samples per second
8	H100 80GB SXM	54	962.5
8	A100 80GB SXM	54	524.2
8	RTX A6000	27	205.4

In this benchmark, using the Stable Diffusion's XL 1.0 model, the H100 80GB SXM produced the fastest inference speeds, while the RTX A6000 produced the slowest.

Bandwidth and throughput Analysis

Using this same benchmark, we found it useful to perform the same test on 1x, 2x, 3x, 4x, 6x, and 8x accelerator configurations, as it reveals interesting insights into the impact of each architecture. This type of comparative measurement can highlight the effect of the interconnect - as well as the accelerator's memory bandwidth - on overall performance.

The following table displays the samples per second achieved using different accelerator configurations, including the amount by which the samples per second increased.

Model	1 GPU	2 GPUs	3 GPUs	4 GPUs	6 GPUs	8 GPUs
H100 80GB SXM	142.3	275.0 (1.93x faster)	400.6 (2.81x faster)	521.8 (3.66x faster)	740.3 (5.20x faster)	962.2 (6.76x faster)
A100 80GB SXM	73.4	143.1 (1.94x faster)	211.2 (2.88x faster)	276.9 (3.77x faster)	399.39 (5.44x faster)	524.2 (7.14x faster)
RTX A6000 (PCIe)	32.5	59.3 (1.82x faster)	86.8 (2.67x faster)	113.9 (3.50x faster)	157.05 (4.83x faster)	205.4 (6.32x faster)

Model	Interconnect	Memory bandwidth	TFLOPs	Power
H100 80GB SXM	SXM (900GB/s)	HBM3 (3.35 TB/s)	267.6 (fp16)	700W
A100 80GB SXM	SXM (600GB/s)	HBM2e (2 TB/s)	77.96 (fp16)	400W
RTX A6000	PCIe (32Gb/s)	GDDR6 (768 GB/s)	38.71 (fp16)	300W

The maximum scaling factor of the A100 80GB SXM (7.14x at 8 GPUs) and H100 80GB SXM (6.76x at 8 GPUs), compared to the RTX A6000 (PCIe), demonstrates the advantages of the SXM interconnect versus PCIe, which typically lands at a scaling factor of 5.5x-6.25x.

This means that at 8 GPUs, PCIe accelerators lose about 25% of their maximum theoretical performance, while SXM hosts only lose about 10%.

Conclusion

These benchmarks show there is no "one size fits all" GPU type - inference performance will vary widely based on the specifics of the use case. Selecting a GPU is not simply a matter of selecting which is the most powerful - weighing cost and performance is highly dependent upon the models used, techniques applied, and the ultimate goal of the customer.

Need help?

CoreWeave's expert support team is always happy to discuss your use case, and which models may be most appropriate for your goals and desired price points.

Learn more

Benchmark comparisons by model​

Pythia 12B​

GPT-NeoX 20B​

GPT-J 6B​

SDXL 1.0 (Stable Diffusion XL 1.0)​

Results​

HGX H100 benchmarks and comparisons​

Bandwidth and throughput Analysis​

Conclusion​

Need help?​

Benchmark comparisons by model

Pythia 12B

GPT-NeoX 20B

GPT-J 6B

SDXL 1.0 (Stable Diffusion XL 1.0)

Results

HGX H100 benchmarks and comparisons

Bandwidth and throughput Analysis

Conclusion

Need help?