GPU Benchmark Comparison

Compare performance benchmarks between models and hardware

Inference benchmarks using various models are used to measure the performance of different GPU node types, in order to compare which GPU offers the best inference performance (the fastest inference times) for each model. The following benchmark comparisons offer some insight as to the inference performance of each node dependent on the model type, to aid in selection decisions.

Note

As there is no "one size fits all" GPU, it is highly recommended that clients benchmark their own workloads in order to determine which GPU node type will be the most efficient for their specific use case.

For an overview of our GPU types, see the GPU Selection Guide.

Benchmark comparisons by model

Important

The benchmark results in this document are reproducible when using an optimized environment. Please note that different versions of Pytorch or NVIDIA drivers may produce non-trivial discrepancies in benchmark results. Using CoreWeave's own ML container images, for example, can improve the performance of some workloads by 5-10%, as compared to the default NVIDIA container.

Benchmarks are organized in the following tables per model, with the best-performing (fastest) node types listed at the top of the column, while the slowest models are listed at the bottom.

Inference parameters used:

  • Format: FP16

  • Tokens: 1024 new tokens

  • Input: Same input

  • Number of samples: 25

GPUAverage time (seconds)Tokens per second

A100 PCIe 80GB

37.97

26.97

A100 PCIe 40GB

40.31

25.4

A40

57.04

17.95

In this benchmark, using EleutherAI's Pythia 12B model, the A100 PCIe 80GB produced the fastest inference speeds, while the A40 produced the slowest.

Inference parameters used:

  • Format: FP16

  • Tokens: 1024 new tokens

  • Input: Same input

  • Number of samples: 25

GPUAverage time (seconds)Tokens per second

A100 PCIe 80GB

46.38

22.08

A40

79.88

12.82

In this benchmark, using EleutherAI's NeoX 20B model, the A100 PCIe 80GB produced the fastest inference speeds, while the A40 produced the slowest.

Inference parameters used:

  • Format: FP16

  • Tokens: 512 new tokens

  • Input: Same input

  • Number of samples: 25

GPUAverage time (seconds)Tokens per second

A100 PCIe 80GB

17.50

29.26

A100 PCIe 40GB

18.04

28.38

A40

21.39

23.94

In this benchmark, using EleutherAI's GPT-J 6B model, the A100 PCIe 80GB produced the fastest inference speeds, and the A40 produced the slowest.

  • Format: FP16

  • Size: 1024x1024

  • Steps: 50

  • Sample average: 10

GPUInference time (seconds)

A40

4.86

A100 PCIe 40GB

6.85

A100 PCIe 80GB

8.0

Results

In this benchmark, using the Stable Diffusion's XL 1.0 model, the A40 produced the fastest inference speeds, and the A100 PCIE 80GB produced the slowest.

HGX H100 benchmarks and comparisons

Tip

Check out CoreWeave's fork of the Optimum Habana trainer code. These H100 benchmarks may be reproduced by following the provided instructions in that repository.

The following benchmarks use Habana's Optimum Habana v1.7 trainer code to measure the performance of the NVIDIA HGX H100 against other GPU SKUs.

Number of GPUs

Model

Batch Size

Samples per second

8

H100 80GB SXM

54

962.5

8

A100 80GB SXM

54

524.2

8

RTX A6000

27

205.4

In this benchmark, using the Stable Diffusion's XL 1.0 model, the H100 80GB SXM produced the fastest inference speeds, while the RTX A6000 produced the slowest.

Bandwidth and throughput Analysis

Using this same benchmark, we found it useful to perform the same test on 1x, 2x, 3x, 4x, 6x, and 8x accelerator configurations, as it reveals interesting insights into the impact of each architecture. This type of comparative measurement can highlight the effect of the interconnect - as well as the accelerator's memory bandwidth - on overall performance.

The following table displays the samples per second achieved using different accelerator configurations, including the amount by which the samples per second increased.

Model1 GPU2 GPUs3 GPUs4 GPUs6 GPUs8 GPUs

H100 80GB SXM

142.3

275.0

(1.93x faster)

400.6

(2.81x faster)

521.8

(3.66x faster)

740.3

(5.20x faster)

962.2

(6.76x faster)

A100 80GB SXM

73.4

143.1 (1.94x faster)

211.2

(2.88x faster)

276.9

(3.77x faster)

399.39 (5.44x faster)

524.2 (7.14x faster)

RTX A6000 (PCIe)

32.5

59.3

(1.82x faster)

86.8

(2.67x faster)

113.9

(3.50x faster)

157.05

(4.83x faster)

205.4

(6.32x faster)

ModelInterconnectMemory bandwidthTFLOPsPower

H100 80GB SXM

SXM (900GB/s)

HBM3 (3.35 TB/s)

267.6 (fp16)

700W

A100 80GB SXM

SXM (600GB/s)

HBM2e (2 TB/s)

77.96 (fp16)

400W

RTX A6000

PCIe (32Gb/s)

GDDR6 (768 GB/s)

38.71 (fp16)

300W

The maximum scaling factor of the A100 80GB SXM (7.14x at 8 GPUs) and H100 80GB SXM (6.76x at 8 GPUs), compared to the RTX A6000 (PCIe), demonstrates the advantages of the SXM interconnect versus PCIe, which typically lands at a scaling factor of 5.5x-6.25x.

This means that at 8 GPUs, PCIe accelerators lose about 25% of their maximum theoretical performance, while SXM hosts only lose about 10%.

Conclusion

These benchmarks show there is no "one size fits all" GPU type - inference performance will vary widely based on the specifics of the use case. Selecting a GPU is not simply a matter of selecting which is the most powerful - weighing cost and performance is highly dependent upon the models used, techniques applied, and the ultimate goal of the customer.

Need help?

CoreWeave's expert support team is always happy to discuss your use case, and which models may be most appropriate for your goals and desired price points.

Last updated