NVIDIA is positioning its Blackwell Ultra platform as a next-generation AI computing choice for hyperscalers, and early benchmark results suggest it’s built specifically for today’s biggest inference challenges: low latency, long context windows, and agentic AI workloads that constantly call models to plan, execute, and iterate.
In recently shared performance testing using SemiAnalysis’ InferenceMAX suite, NVIDIA highlighted how the GB300 NVL72 rack-scale system delivers major gains in both efficiency and throughput compared to the prior Hopper generation. The headline metric NVIDIA is pushing is “tokens per watt,” a practical way to measure how much real inference work a data center can get done for its power budget. With hyperscalers racing to scale capacity, that number increasingly matters as much as raw speed.
According to NVIDIA’s results, GB300 NVL72 can achieve up to 50 times more throughput per megawatt than Hopper in an optimized, real-world deployment configuration. That’s a huge jump in data center economics, where power and cooling limits are often the primary bottlenecks.
A big reason behind these gains is NVIDIA’s expanded NVLink approach in Blackwell Ultra. Instead of smaller NVLink groupings, the GB300 NVL72 is designed around a 72-GPU NVLink fabric, effectively turning an entire rack into a more unified, high-bandwidth compute pool. NVIDIA cites 130 TB/s of connectivity across that NVLink fabric. By comparison, Hopper-based designs were far more limited in how many chips could be connected tightly via NVLink, which can restrict scaling once workloads outgrow smaller multi-GPU islands.
NVIDIA also credits the combination of rack architecture co-design and its newer precision approach, including the NVFP4 format, as key factors that help boost throughput while keeping efficiency high. The takeaway is clear: Blackwell Ultra isn’t just “more GPUs,” it’s a platform tuned to move data faster between accelerators while pushing more tokens through the same power envelope.
Cost-per-token is another area where NVIDIA claims a dramatic advantage, particularly for inference-heavy customers such as frontier AI labs and hyperscalers serving massive user demand. In the GB300 NVL72 testing shared by NVIDIA, the company reports up to a 35 times reduction in cost per million tokens compared to Hopper. If that holds up broadly, it translates into substantially lower operating costs for large-scale AI services and more headroom to run complex multi-step agent workflows without expenses ballooning.
Because agentic AI often requires long context to maintain state—such as keeping track of large codebases, tool outputs, or extended conversations—NVIDIA also compared Blackwell Ultra against the earlier Blackwell generation for long-context workloads. In that GB200 vs. GB300 NVL72 comparison, NVIDIA points to up to 1.5 times lower cost per token and up to 2 times faster attention processing. Attention performance is crucial in long-context inference, where latency and compute requirements can rise quickly as context windows grow.
These results arrive while Blackwell Ultra systems are still rolling into hyperscaler environments, making this some of the earliest benchmark data we’ve seen for GB300 NVL72. Even so, the numbers reinforce the direction the industry is moving: rack-scale interconnect, higher memory bandwidth, better efficiency metrics like tokens per watt, and specialized optimization for long-context inference.
With NVIDIA already teasing what comes after Blackwell, the broader message is that the infrastructure race is increasingly defined by who can deliver the best real-world inference economics at scale—especially for agentic AI, where low-latency responses and long-context reasoning are no longer optional features, but baseline requirements.






