OpenAI Unites an AI Dream Team to Supercharge Large-Scale Model Training

OpenAI is taking a big step toward faster, more reliable large-scale AI training by teaming up with some of the most influential names in computing and networking: AMD, Broadcom, Intel, Microsoft, and NVIDIA. Together, they’ve developed a new networking protocol designed specifically for the brutal demands of training massive AI models in huge GPU clusters.

The initiative centers on a protocol called MRC (Multipath Reliable Connection). The goal is straightforward: improve GPU networking performance and make large training clusters more resilient when the network gets congested or parts of it fail. OpenAI has now released MRC through the Open Compute Project, a move meant to help the broader AI and infrastructure industry adopt and build on it.

Why does this matter? In large AI training runs, moving data between GPUs is a constant, high-stakes operation. If even a single data transfer arrives late, it can slow down synchronized training jobs and leave expensive GPUs waiting with nothing to do. These delays often come from network congestion, link issues, or device failures—and the bigger the cluster, the more frequently these problems show up.

MRC is designed to tackle those issues head-on. Built into the latest 800 Gb/s network interfaces, it allows a single data transfer to be spread across hundreds of independent, uninterrupted paths. That means the system can route around failures in microseconds instead of stalling, while also enabling simpler network control compared to traditional approaches.

One of the key ideas behind MRC is changing how an 800 Gb/s network interface is used. Rather than treating it as one massive pipe, the interface can be split into multiple smaller links. For example, a single 800 Gb/s interface can connect to eight different switches and operate like eight parallel 100 Gb/s network “planes.” This seemingly simple shift dramatically changes how large clusters can be built.

With this approach, a switch that would normally connect 64 ports at 800 Gb/s can instead connect 512 ports at 100 Gb/s. In practical terms, that can enable a fully connected network of roughly 131,000 GPUs using only two tiers of switches—where a conventional 800 Gb/s design might require three or even four tiers. Fewer tiers can mean lower complexity, fewer failure points, and better overall efficiency at scale.

Technically, MRC extends RDMA over RoCE (Remote Direct Memory Access over Converged Ethernet), which enables hardware-accelerated data movement between GPUs and CPUs—critical for high-performance AI workloads.

OpenAI says it has already deployed MRC across its supercomputing environments running NVIDIA GB200 “Blackwell” GPUs used for training Frontier models. These deployments include systems hosted at Oracle Cloud Infrastructure in Abilene, Texas, as well as Microsoft’s Fairwater supercomputers. The protocol has also been used to train multiple OpenAI models across NVIDIA and Broadcom hardware, demonstrating its ability to work across major ecosystem partners.

Looking ahead, MRC is expected to play a foundational role in OpenAI’s Stargate supercomputer effort being built at Oracle Cloud Infrastructure in Abilene. That project targets an enormous 10GW of AI compute by 2029, and it has reportedly already brought more than 3GW online in the past three months. By making MRC available to the wider industry, OpenAI is positioning the protocol as a shared building block for the next generation of AI supercomputers—helping the entire ecosystem reduce training slowdowns, improve reliability, and scale clusters more efficiently.

OpenAI Unites an AI Dream Team to Supercharge Large-Scale Model Training

Share this:

Related Posts: