Intel pairs Gaudi 3 with NVIDIA Blackwell in a hybrid rack-scale AI system promising faster inference
Intel is taking a pragmatic turn in the AI arms race by blending its Gaudi 3 accelerators with NVIDIA’s Blackwell B200 GPUs in a single rack-scale platform. Shown at the OCP Global Summit, the design splits inference workloads to play to each chip’s strengths: Blackwell handles the heavy prefill stage, while Gaudi 3 tackles the decode phase where memory bandwidth and efficient Ethernet scale-out matter most.
Why this division makes sense
– Blackwell B200 thrives on massive matrix-multiply bursts across full context windows, making it ideal for the prefill portion of large language model inference.
– Gaudi 3 is positioned as a cost-efficient decode engine with strong memory bandwidth characteristics and an Ethernet-first scale-out strategy, reducing total cost per token for the decode path.
Rack-scale architecture at a glance
– Compute tray: dual Xeon CPUs, four Gaudi 3 accelerators, four 400 GbE NICs, and one BlueField-3 DPU
– Fabric and switching: NVIDIA ConnectX-7 400 GbE on compute trays, aggregated through Broadcom Tomahawk 5 switches rated at 51.2 Tb/s to enable all-to-all rack connectivity
– Density: sixteen compute trays per rack, designed for high-throughput inference at data center scale
Early performance claims
Intel’s hybrid setup is said to deliver up to 1.7x faster prefill performance versus a B200-only baseline on small, dense models. These figures have not been independently validated and should be treated as preliminary, but they point to a compelling direction: disaggregating inference across heterogeneous silicon to boost throughput and lower costs.
What each side gains
– Intel can monetize Gaudi by bundling it into a complementary decode role within NVIDIA-heavy environments, rather than competing head-on for every stage of inference.
– NVIDIA’s networking stack gets a showcase, with ConnectX-7 NICs and Ethernet-centric designs proving their mettle in a mixed-silicon deployment.
Caveats to watch
– Software maturity remains a hurdle for Gaudi, and a relatively young stack could slow adoption.
– With indications that the Gaudi architecture may be nearing a transition, long-term mainstream uptake of this exact configuration is uncertain.
– Hybrid deployments add operational complexity, demanding robust orchestration and model partitioning tools to consistently split prefill and decode across different accelerators.
Who this is for
– Hyperscalers and large AI service providers pushing LLM inference at scale, where shaving latency and cost in decode can meaningfully improve TCO.
– Teams standardizing on Ethernet and seeking alternatives to fully homogeneous GPU racks without sacrificing prefill throughput.
Bottom line
A heterogeneous rack built around Blackwell for prefill and Gaudi 3 for decode is a smart, workload-aware design that leans into each platform’s strengths. If the reported gains hold up under independent testing and the software ecosystem matures, this approach could become a viable template for high-efficiency, rack-scale AI inference in data centers.






