NVIDIA Blackwell Ultra GB300 and AMD Instinct MI355X dominate MLPerf v5.1 AI inference benchmarks
The latest MLPerf v5.1 results are in, and two heavyweights just stole the show. NVIDIA’s Blackwell Ultra GB300 and AMD’s Instinct MI355X arrived with big expectations, and the new numbers confirm it: both accelerators delivered standout AI inference performance across today’s most demanding workloads. Intel’s Arc Pro B60 also made an appearance, underlining its value-driven positioning even if it doesn’t chase pure datacenter speed.
DeepSeek R1 highlights
– Offline: GB300 posts a crushing win, delivering a 45% uplift over GB200 in a 72‑GPU setup and a 44% gain in an 8‑GPU comparison. That lands almost exactly where NVIDIA had guided with Blackwell Ultra.
– Server: The GB300 extends its lead with a 25% improvement in the 72‑GPU run and a 21% advantage in the 8‑GPU run versus GB200.
Large-language-model performance
– Llama 3.1 405B (Offline): AMD’s Instinct MI355X shows up strong with a 27% performance increase compared to the GB200 submission. While there wasn’t an 8‑GPU MI355X entry here, the multi-accelerator gain is notable.
– Llama 2 70B (Offline): MI355X delivers 648,248 tokens/s with 64 accelerators, 350,820 tokens/s with 32, and 65,770 tokens/s with 8. In the 8‑accelerator match-up, that’s a massive 2.09x over the GB200 configuration. For context, Intel’s Arc Pro B60 scores 3,009 tokens/s here; while it’s nowhere near the datacenter-class monsters, it targets a far more accessible price and power envelope.
Per-accelerator records claimed by Blackwell Ultra
NVIDIA shared a broad set of per-accelerator highs across popular AI tasks:
– DeepSeek‑R1: 5,842 tokens/s (Offline); 2,907 tokens/s (Server)
– Llama 3.1 405B: 224 tokens/s (Offline); 170 tokens/s (Server); 138 tokens/s (Interactive)
– Llama 2 70B 99.9%: 12,934 tokens/s (Offline); 12,701 tokens/s (Server); 7,856 tokens/s (Interactive)
– Llama 2 70B 99%: 13,015 tokens/s (Offline); 12,701 tokens/s (Server); 7,856 tokens/s (Interactive)
– Llama 3.1 8B: 18,370 tokens/s (Offline); 16,099 tokens/s (Server); 15,284 tokens/s (Interactive)
– Stable Diffusion XL: 4.07 samples/s (Offline); 3.59 queries/s (Server)
– Mixtral 8x7B: 16,099 tokens/s (Offline); 16,131 tokens/s (Server)
– DLRMv2 99%: 87,228 samples/s (Offline); 80,515 samples/s (Server)
– DLRMv2 99.9%: 48,666 samples/s (Offline); 46,259 queries/s (Server)
– Whisper: 5,667 tokens/s (Offline)
– R‑GAT: 81,404 samples/s (Offline)
– Retinanet: 1,875 samples/s (Offline); 1,801 queries/s (Server)
Reasoning gains over Hopper
Blackwell Ultra also posts eye‑opening wins in reasoning-heavy workloads. On DeepSeek‑R1, it delivers a 4.7x advantage offline and a 5.2x advantage in server mode compared with Hopper:
– Hopper: 1,253 tokens/s per GPU (Offline); 556 tokens/s per GPU (Server)
– Blackwell Ultra: 5,842 tokens/s per GPU (Offline); 2,907 tokens/s per GPU (Server)
Why it matters
– For hyperscalers and enterprises scaling generative AI, the GB300’s per‑GPU efficiency and leadership in reasoning workloads can translate directly into lower latency, higher throughput, and better cost-per-token for large deployments.
– AMD’s MI355X is emerging as a serious contender in multi‑accelerator LLM inference, especially on Llama workloads where it posted sizable gains versus prior-generation rivals.
– Intel’s Arc Pro B60 underscores a different value proposition, offering respectable results for its class without the datacenter footprint.
What to watch next
Vendors typically refine kernels, memory pipelines, and software stacks between MLPerf rounds. Expect even higher scores as NVIDIA, AMD, and Intel squeeze more from their current platforms in the next submission cycle. For teams planning AI infrastructure, these trends point to rapid generational gains, particularly in reasoning and high-token-rate LLM inference.






