Skip to content

NVIDIA GH200 Llama 3.1 Inference Performance Testing on VALDI

The NVIDIA H100 is a standalone GPU based on the Hopper architecture, while the GH200 (available in one of VALDI's US-based DCs) is an integrated CPU-GPU module based on the Grace Hopper architecture. The GH200 combines an ARM-based Grace CPU with a Hopper GPU, offering tighter integration and potentially better performance for certain workloads. It features more memory (96GB HBM3e for the GPU plus CPU memory) compared to the H100's 80GB HBM3, and uses NVLink-C2C for high-bandwidth, low-latency CPU-GPU communication. The GH200 is well-suited for applications like LLMs that benefit from close CPU-GPU cooperation. GH200s are best suited for workloads that require tight integration between parallel operations on the GPU and pre/post processing operations tailored for CPUs. For example, in Llama 3.1 testing GH200s have a similar average inference throughput to H100s, however, because of increased I/O bandwidth of GH200s, peak throughput is up to 68% higher in initial testing compared to Hopper-only nodes.

Benchmarking Results

We conducted extensive benchmarks of Llama 3.x across NVIDIA GH200 GPUs. We implemented a custom script to measure Tokens Per Second (TPS) throughput. We expect to enhance the testing approach over-time, but here are our initial findings:

NVIDIA 1xGH200 480G

Model Batch Size Max Input Avg TPS stdev TPS
Meta-Llama-3.1-8B 64 1000 3744.68 231.17

Try It Yourself

Want to experiment with Llama 3.1 optimization? Check out our GPU marketplace to rent high-performance NVIDIA GPUs. Sign up now and start optimizing in minutes!

Rent a GPU Now


Keywords: Llama 3.1, 70B model, 405B model, NVIDIA GPU, performance optimization, inference optimization, NLP, large language models