A data center with multiple rows of black server racks, illuminated by streams of light representing data flow.

NVIDIA’s Blackwell Breakthrough: How “Extreme Codesign” Slashed AI Token Costs by 10x

NVIDIA’s Blackwell platform is quickly becoming the new benchmark for AI inference efficiency, and the company is now highlighting a major leap forward in what many developers and businesses care about most: token economics. In simple terms, it’s about getting more usable AI output for less money and in less time—and Blackwell is showing a dramatic improvement over the previous Hopper generation.

According to NVIDIA, the Blackwell platform can reduce cost per token by as much as 10x compared with Hopper. That’s a huge deal for inference providers, where every fraction of a cent matters at scale and where latency can make or break real-world applications. This is a key reason several inference-focused companies are already adopting Blackwell to serve advanced open source AI models that have reached frontier-level capability.

NVIDIA says these gains aren’t coming from hardware alone. The company points to a combination of three factors driving the improvement: frontier-grade open source intelligence, tight hardware-software co-design in Blackwell, and highly optimized inference stacks built by the providers themselves. Together, those elements are helping deliver noticeably lower token costs for businesses across a wide range of industries.

Blackwell’s benefits are also being reported across different use cases, from general-purpose inference hosting to specialized deployments. NVIDIA highlights organizations such as Baseten and Sully.ai, as well as DeepInfra and Latitude, as examples of teams seeing lower latency, better inference costs, and more consistent responses. For multi-agent workflows and specialized AI agent deployments, Sentient Labs is cited as achieving about 25% to 50% better cost efficiency compared with Hopper—an important improvement for agentic systems that can multiply token usage quickly.

A major technical reason behind these results is Blackwell’s focus on “extreme co-design,” an approach NVIDIA says aligns well with modern Mixture-of-Experts (MoE) architectures. The GB200 NVL72 system is a centerpiece of that strategy, using a 72-chip configuration and roughly 30TB of fast shared memory. This setup is designed to push “expert-level” parallelism further by constantly splitting and distributing token batches across GPUs while keeping data movement efficient as workloads scale. NVIDIA argues that this architecture helps explain why Blackwell can deliver its strongest tokenomics yet, especially as inference grows more complex and more distributed.

Looking ahead, NVIDIA is already framing the next step in efficiency gains with its Vera Rubin roadmap. The company suggests future improvements will come from architectural advances and specialized mechanisms such as CPX for prefill, alongside broader infrastructure optimizations. As AI models, workflows, and deployment demands evolve rapidly, NVIDIA’s message is clear: performance isn’t only about building faster chips—it’s also about making every token cheaper, faster, and more reliable to generate in real-world inference.