NVIDIA has provided an in-depth look at its Blackwell AI platform, highlighting how it utilizes a novel high-bandwidth interface to connect two GPUs seamlessly. The Blackwell platform showcases a range of components, including:
– Blackwell GPU
– Grace CPU
– NVLINK Switch Chip
– Bluefield-3
– ConnectX-7
– ConnectX-8
– Spectrum-4
– Quantum-3
The platform is powered by over 400 optimized CUDA-X libraries designed for high performance across various application domains. These libraries are continuously evolving to support new algorithms, making them future-proof for upcoming AI models.
### Key Features of the Blackwell AI Platform
#### AI Superchip
– **Transistors**: 208 billion
– **Technology**: TSMC 4NP process (>1600mm²)
– **Capabilities**: Features include a transformer engine supporting FP4/FP6 data formats, a secure AI engine with full performance encryption, 5th Gen NVLINK scaling up to 576 GPUs, a decompression engine with 800 GB/s bandwidth, and an RAS engine for in-system self-tests.
#### Blackwell GPU
The Blackwell GPU stands out with unprecedented compute, memory bandwidth, and interconnect capabilities. It employs NV-HBI (NVIDIA High-Bandwidth Interface) to merge two reticle-limited GPUs into one cohesive unit, yielding:
– **20 PetaFLOPS FP4 AI**
– **8 TB/s memory bandwidth**
– **1.8 TB/s bidirectional NVLINK bandwidth**
#### Multi-Die Architecture
NVIDIA’s shift toward a multi-die architecture began with Ampere. This design has been refined over successive generations, now incorporating a 2-die implementation linked via NV-HBI, offering 10 TB/s of bidirectional bandwidth while keeping energy consumption low.
#### 5th Generation Tensor Cores
These cores introduce new micro-tensor formats like FP4, FP6, and FP8, which offer enhanced performance capabilities and efficiency:
– **FP4** provides 4x the speed of Hopper’s FP8
– **FP6** doubles the speed of Hopper’s FP8
NVIDIA’s Quasar Quantization system further refines low-precision formats such as FP4, achieving high accuracy in data processing.
### System Integration
#### NVLINK 5th Generation
This component connects the entire AI platform with 1.8 TB/s bandwidth across 18 NVLINKs.
#### NVLINK Switch Chip (4th Gen)
Configured within the NVLINK Switch Tray, this component connects up to 72 GPUs within a single rack, providing 14.4 TB/s of combined bandwidth.
#### Grace Blackwell Superchip
Combining 1 Grace CPU with 2 Blackwell GPUs, this superchip offers 40 PetaFLOPS of FP4 and 20 PetaFLOPS of FP8 compute power.
#### Spectrum-X
This Ethernet fabric is designed for AI, consisting of the Spectrum-4 chip with 100 billion transistors and Bluefield-3 DPU, creating an end-to-end platform for cloud AI workloads.
### Future Developments
NVIDIA plans to continue advancing with the Blackwell Ultra in 2025, followed by the Rubin and Rubin Ultra architectures featuring HBM4 in 2026-2027. The entire ecosystem of CPUs, networking, and interconnect solutions will receive significant upgrades during this period, ensuring continued leadership in AI computing.
In summary, NVIDIA’s Blackwell AI platform represents a significant leap forward in AI performance, demonstrating a 30x real-time inference improvement over its predecessor, Hopper, and a 25x increase in energy efficiency. The journey in hardware innovation continues with promising future developments on the horizon.






