Huawei’s Ascend 950PR Is Winning China’s AI Market by Echoing CUDA via CANN—Challenging NVIDIA’s Stronghold

Huawei is making a fresh push to win over China’s biggest AI data centers with its latest accelerator, the Ascend 950PR. While it may not match NVIDIA’s top-end compute performance in every scenario, the new chip is drawing serious attention for a different reason: a much more CUDA-like software experience that could make it far easier for developers to move AI training and inference workloads onto Huawei hardware.

For years, Chinese chipmakers have tried to loosen NVIDIA’s grip on the AI market by improving architectures and adding features. But many large customers have continued to prefer NVIDIA, and the gap hasn’t been only about raw speed. CUDA has been a major deciding factor, because it’s the software foundation that developers and AI teams already know, and it powers mature tools, workflows, and optimized code.

Huawei’s answer is an upgraded software stack called CANN Next. The big shift is that CANN Next now supports a SIMT-style programming model that looks and feels closer to CUDA, including concepts such as thread blocks, warps, and kernel launches. The goal isn’t simply to provide a basic compatibility layer. Instead, Huawei is pushing toward near drop-in equivalents to familiar CUDA patterns, effectively treating CUDA-like programming as a standard developer experience while still optimizing execution specifically for Ascend chips.

In practical terms, this approach aims to reduce friction for teams that have years of CUDA-centric development behind them. Developers can work in a CUDA-like model, while performance tuning under the hood is tailored to Ascend at scale—optimizing details like thread counts and block sizing to better match Huawei’s hardware design. That combination is a major reason the Ascend 950PR is being viewed as more compelling than earlier attempts.

On the hardware side, the Ascend 950PR is positioned squarely for modern AI workloads that rely heavily on low-precision math. It reportedly supports low-precision formats up to FP8, delivering around 1 PFLOPS of FP8 compute, and up to 2 PFLOPS on FP4. It’s also built for high-throughput scaling, with reported interconnect bandwidth of 2 TB/s.

Memory is another key part of Huawei’s pitch. The 950PR is said to feature the company’s first self-built HBM solution, called HiBL 1.0, offering 128GB capacity and 1.6 TB/s bandwidth. Beyond performance, controlling HBM supply can help reduce production bottlenecks—an important factor for any accelerator that hopes to see broad data center adoption.

Interest appears to be rising quickly. Reports indicate that large tech firms, including major hyperscalers such as ByteDance and Alibaba, are preparing to place orders, and Huawei is aiming to produce as many as 750,000 units this year. If that supply target holds, it would address one of the most common constraints facing domestic AI chips: availability at meaningful scale.

The timing also matters. China’s hyperscalers have been under growing pressure to find dependable alternatives for AI compute, as sourcing leading-edge imported accelerators can bring added complexity and uncertainty. Some firms have leaned on offshore compute rentals, while others have increased investment in domestic options. With Ascend 950PR and the more CUDA-like direction of CANN Next, Huawei is aiming to become a more realistic default choice for AI training and inference inside China—assuming it can meet demand and customers are ready to deploy at scale.

Huawei’s Ascend 950PR Is Winning China’s AI Market by Echoing CUDA via CANN—Challenging NVIDIA’s Stronghold

Share this:

Related Posts: