Inside NVIDIA Vera Rubin: The Million-Part AI System Redefining Complexity

NVIDIA’s next-generation Vera Rubin platform has now moved into full production, and the company is offering a clearer look than ever at what’s inside the rack-level system and why it’s being positioned as a major leap for AI infrastructure.

At the center of the design is the NVL72 rack, built around the Vera Rubin SuperChip and a refreshed rack architecture that touches nearly every layer of the stack: compute, memory, interconnects, and cooling. NVIDIA’s infrastructure leadership has described Vera Rubin as one of the most complex AI systems in the world, emphasizing that the challenge isn’t just building faster chips, but delivering a complete, scalable rack that can be deployed and maintained in real data centers.

A big part of the performance jump comes from memory. Vera Rubin pairs its GPU with HBM4 and adds dedicated SOCAMM modules, pushing memory bandwidth up to around 1.2 TB/s. For modern AI workloads, especially large-scale training and high-throughput inference, memory bandwidth is often a limiting factor—so this is a crucial upgrade for keeping the compute fed and reducing bottlenecks.

Cooling is another major focus in Vera Rubin. NVIDIA is moving toward a modular liquid cooling approach for the SuperChip components, using dedicated cold plates to cover key elements such as the Rubin GPU and Vera CPU. The company is also framing this as more than just a performance decision: the updated cooling strategy is presented as a practical path for hyperscalers to adopt more capable liquid cooling deployments, while also reducing water use compared to traditional implementations.

On the networking and scaling front, NVLink continues to be a defining feature of NVIDIA’s rack-scale AI systems. Vera Rubin introduces 6th-generation NVLink—often referred to as the NVLink Spine—with NVIDIA targeting a massive 260 TB/s of total aggregate bandwidth per rack. Beyond raw speed, the company is highlighting modularity improvements that enable zero-downtime maintenance, along with rack-level reliability, availability, and serviceability features designed for high-uptime environments.

Pricing expectations are pointing toward a potential increase for the new generation, but NVIDIA’s message is that overall economics improve at scale. The company claims Vera Rubin can deliver up to a 10x reduction in inference token cost and cut the number of GPUs needed to train Mixture-of-Experts (MoE) models by 4x compared to Blackwell GB200. In other words, the pitch is that upfront cost may rise, but the cost to produce tokens and train advanced models should drop significantly—an equation that matters most to buyers running AI at massive volumes.

With customer commitments expected soon, Vera Rubin is shaping up as NVIDIA’s next major rack-scale AI platform, built to push bandwidth, efficiency, and serviceability forward at the same time.

Inside NVIDIA Vera Rubin: The Million-Part AI System Redefining Complexity

Share this:

Related Posts: