Skip to main content

GPU Benchmark Comparison

Compare performance benchmarks between models and hardware

Inference benchmarks using various models are used to measure the performance of different GPU node types, in order to compare which GPU offers the best inference performance (the fastest inference times) for each model. The following benchmark comparisons offer some insight as to the inference performance of each node dependent on the model type, to aid in selection decisions.

Note

As there is no "one size fits all" GPU, it is highly recommended that clients benchmark their own workloads in order to determine which GPU node type will be the most efficient for their specific use case.

For an overview of our GPU types, see the GPU Selection Guide.

Benchmark comparisons by model

Important

The benchmark results in this document are reproducible when using an optimized environment. Please note that different versions of Pytorch or NVIDIA drivers may produce non-trivial discrepancies in benchmark results. Using CoreWeave's own ML container images, for example, can improve the performance of some workloads by 5-10%, as compared to the default NVIDIA container.

Benchmarks are organized in the following tables per model, with the best-performing (fastest) node types listed at the top of the column, while the slowest models are listed at the bottom.

Pythia 12B

Inference parameters used:

  • Format: FP16
  • Tokens: 1024 new tokens
  • Input: Same input
  • Number of samples: 25
GPUAverage time (seconds)Tokens per second
A100 PCIe 80GB37.9726.97
A100 PCIe 40GB40.3125.4
A4057.0417.95

In this benchmark, using EleutherAI's Pythia 12B model, the A100 PCIe 80GB produced the fastest inference speeds, while the A40 produced the slowest.

GPT-NeoX 20B

Inference parameters used:

  • Format: FP16
  • Tokens: 1024 new tokens
  • Input: Same input
  • Number of samples: 25
GPUAverage time (seconds)Tokens per second
A100 PCIe 80GB46.3822.08
A4079.8812.82

In this benchmark, using EleutherAI's NeoX 20B model, the A100 PCIe 80GB produced the fastest inference speeds, while the A40 produced the slowest.

GPT-J 6B

Inference parameters used:

  • Format: FP16
  • Tokens: 512 new tokens
  • Input: Same input
  • Number of samples: 25
GPUAverage time (seconds)Tokens per second
A100 PCIe 80GB17.5029.26
A100 PCIe 40GB18.0428.38
A4021.3923.94

In this benchmark, using EleutherAI's GPT-J 6B model, the A100 PCIe 80GB produced the fastest inference speeds, and the A40 produced the slowest.

SDXL 1.0 (Stable Diffusion XL 1.0)

  • Format: FP16
  • Size: 1024x1024
  • Steps: 50
  • Sample average: 10
GPUInference time (seconds)
A404.86
A100 PCIe 40GB6.85
A100 PCIe 80GB8.0

Results

In this benchmark, using the Stable Diffusion's XL 1.0 model, the A40 produced the fastest inference speeds, and the A100 PCIE 80GB produced the slowest.

HGX H100 benchmarks and comparisons

Tip

Check out CoreWeave's fork of the Optimum Habana trainer code. These H100 benchmarks may be reproduced by following the provided instructions in that repository.

The following benchmarks use Habana's Optimum Habana v1.7 trainer code to measure the performance of the NVIDIA HGX H100 against other GPU SKUs.

Number of GPUsModelBatch SizeSamples per second
8H100 80GB SXM54962.5
8A100 80GB SXM54524.2
8RTX A600027205.4

In this benchmark, using the Stable Diffusion's XL 1.0 model, the H100 80GB SXM produced the fastest inference speeds, while the RTX A6000 produced the slowest.

Bandwidth and throughput Analysis

Using this same benchmark, we found it useful to perform the same test on 1x, 2x, 3x, 4x, 6x, and 8x accelerator configurations, as it reveals interesting insights into the impact of each architecture. This type of comparative measurement can highlight the effect of the interconnect - as well as the accelerator's memory bandwidth - on overall performance.

The following table displays the samples per second achieved using different accelerator configurations, including the amount by which the samples per second increased.

Model1 GPU2 GPUs3 GPUs4 GPUs6 GPUs8 GPUs
H100 80GB SXM142.3275.0

(1.93x faster)
400.6

(2.81x faster)
521.8

(3.66x faster)
740.3

(5.20x faster)
962.2

(6.76x faster)
A100 80GB SXM73.4143.1

(1.94x faster)
211.2

(2.88x faster)
276.9

(3.77x faster)
399.39

(5.44x faster)
524.2

(7.14x faster)
RTX A6000 (PCIe)32.559.3

(1.82x faster)
86.8

(2.67x faster)
113.9

(3.50x faster)
157.05

(4.83x faster)
205.4

(6.32x faster)
ModelInterconnectMemory bandwidthTFLOPsPower
H100 80GB SXMSXM (900GB/s)HBM3 (3.35 TB/s)267.6 (fp16)700W
A100 80GB SXMSXM (600GB/s)HBM2e (2 TB/s)77.96 (fp16)400W
RTX A6000PCIe (32Gb/s)GDDR6 (768 GB/s)38.71 (fp16)300W

The maximum scaling factor of the A100 80GB SXM (7.14x at 8 GPUs) and H100 80GB SXM (6.76x at 8 GPUs), compared to the RTX A6000 (PCIe), demonstrates the advantages of the SXM interconnect versus PCIe, which typically lands at a scaling factor of 5.5x-6.25x.

This means that at 8 GPUs, PCIe accelerators lose about 25% of their maximum theoretical performance, while SXM hosts only lose about 10%.

Conclusion

These benchmarks show there is no "one size fits all" GPU type - inference performance will vary widely based on the specifics of the use case. Selecting a GPU is not simply a matter of selecting which is the most powerful - weighing cost and performance is highly dependent upon the models used, techniques applied, and the ultimate goal of the customer.

Need help?

CoreWeave's expert support team is always happy to discuss your use case, and which models may be most appropriate for your goals and desired price points.

Learn more