NVIDIA technology hardware with logo, featuring advanced computing components.

NVIDIA’s Blackwell Ultra Dominates All Seven MLPerf Training Benchmarks; GB300 NVL72 Sets 10-Minute Llama 405B Record

NVIDIA’s GB300 NVL72 Blackwell Ultra racks sweep MLPerf training, raise the bar for AI speed and scale

NVIDIA has claimed a clean sweep across the latest MLPerf training benchmarks, showcasing its Blackwell Ultra–based GB300 NVL72 as the rack‑scale platform to beat for large‑scale AI training. According to the company, it was the only participant to submit results for every test this round, widening the performance gap over competing systems and its own previous-generation hardware.

A few headline results underscore how fast the latest platform is:
– Llama 3.1 405B: 10 minutes
– Llama 2 70B LoRA: 0.4 minutes
– Llama 3.1 8B: 5.2 minutes
– FLUX.1: 12.5 minutes
– DLRM-dcnv2: 0.71 minutes
– R-GAT: 1.1 minutes
– RetinaNet: 1.4 minutes

Beyond raw times, the comparisons tell the story. Using the same number of GPUs in a rack-scale configuration, Blackwell Ultra outpaced Hopper-based systems by wide margins. In Llama 3.1 40B pretraining, GB300 delivered more than 4x the performance of H100 and nearly 2x the performance of the Blackwell GB200. For Llama 2 70B fine‑tuning, just eight GB300 GPUs achieved 5x the throughput of H100.

NVIDIA attributes the gains to a combination of silicon, software, memory, and networking:
– CUDA software stack and a mature developer ecosystem optimized for training at scale
– Quantum‑X800 InfiniBand with 800 GB/s networking for fast, low‑latency cluster communication
– 279 GB of HBM3e per GPU and up to 40 TB of pooled GPU and CPU memory in an NVL72 rack for massive model and batch sizes
– End‑to‑end FP4 precision for LLM training, which roughly doubles throughput versus FP8, and up to around 3x with Blackwell Ultra’s advancements

Scale is a central theme. Compared with NVIDIA’s June submission, the latest run trained the Llama 3.1 405B‑parameter model in just 10 minutes using 5,120 Blackwell GPUs. The emphasis on FP4 at every layer, combined with the bandwidth and memory depth of the NVL72, enabled these results without needing to increase GPU counts compared to earlier rack‑level setups.

For enterprises racing to train and fine‑tune frontier models, the takeaway is clear: the GB300 NVL72 platform currently offers leading training performance across diverse MLPerf workloads—from LLMs and LoRA fine‑tuning to recommendation systems and vision models—making it a compelling choice for organizations pushing the limits of AI at rack scale.