Man showcasing NVIDIA GPU on stage with server racks in the background.

NVIDIA Crushes MLPerf Inference v6.0 With Blackwell Ultra, Delivering a Massive Lead Over Rivals

NVIDIA is once again making a loud statement in the AI hardware race, becoming one of the first companies to publish a full, “extensive” MLPerf Inference v6.0 submission and claiming the strongest performance relative to the rest of the field combined. In its latest update, NVIDIA points to Blackwell Ultra as the key ingredient behind what it calls the highest AI factory throughput and the lowest token cost, reinforcing the company’s message that performance gains aren’t just about raw hardware, but about the entire platform working as one.

MLPerf Inference is widely seen as one of the most demanding, industry-standard AI benchmarking suites because it tests real-world inference workloads across established rules. Version 6.0 raises the bar further by adding support for newer reasoning and Mixture-of-Experts (MoE) models such as DeepSeek-R1, GPT-OSS-120B, and Mixtral 8x7B. It also expands coverage of dense large language models, generative recommender systems, and vision-language models, reflecting the types of workloads enterprises are actually deploying today.

NVIDIA’s posted results highlight significant improvements on the GB300 NVL72 platform between MLPerf Inference v5.1 and v6.0. For DeepSeek-R1 in the server scenario, throughput rose from 2,907 tokens/sec per GPU to 8,064 tokens/sec per GPU, a 2.77x increase. In the offline scenario, DeepSeek-R1 improved from 5,842 to 9,821 tokens/sec per GPU, a 1.68x gain. For Llama 3.1 405B, performance also climbed: server mode improved from 170 to 259 tokens/sec per GPU (1.52x), and offline mode moved from 224 to 271 tokens/sec per GPU (1.21x).

One of the more notable takeaways is that NVIDIA attributes a meaningful portion of these gains to software and platform optimizations, not hardware changes. The company says that since its earlier DeepSeek-R1 submission, it has achieved up to a 2.7x increase in token throughput through optimization work alone. At the same time, NVIDIA claims that compared with the prior GB200 NVL72 generation, Blackwell Ultra can deliver up to a 2.77x speedup under the v6.0 benchmark rules, suggesting the improvements hold up even under stricter, more current testing.

NVIDIA also emphasizes that achieving high inference throughput at scale requires what it describes as “extreme co-design” across chips, system architecture, data center design, and software. In other words, it’s positioning its advantage as an end-to-end systems win rather than a single-component victory. The company argues that this approach is why its MLPerf Inference v6.0 results show strong performance across a wide spread of AI workloads, from massive LLM inference to vision-language models and generative recommendation pipelines.

Beyond bragging rights, NVIDIA is clearly aiming at the metrics that matter to organizations building and operating AI infrastructure: token cost, total cost of ownership, and efficiency at deployment scale. The company’s narrative is that higher tokens-per-second performance and better token-per-dollar figures translate into more usable AI output for the same investment, especially in large “AI factory” style deployments.

MLPerf remains an especially tough proving ground, and NVIDIA’s willingness to submit broad benchmark coverage continues to set it apart in terms of public, standardized performance reporting. With MLPerf Inference v6.0 expanding into newer model types and practical enterprise workloads, NVIDIA is using these results to reinforce its claim that Blackwell Ultra delivers leading inference performance where it counts most: modern reasoning, large language models, and production-scale AI services.