DeepSeek Unveils "FlashMLA": Transforming AI with an 8x TFLOPS Surge Using NVIDIA H800 GPUs in China

China is stepping up its game in the AI industry by finding innovative solutions to maximize the performance of NVIDIA’s “cut-down” AI accelerators. DeepSeek, a Chinese company, has unveiled a ground-breaking project that leverages the Hopper H800s AI accelerators to deliver an impressive eight-fold increase in TFLOPS.

By harnessing the power of software, DeepSeek has developed a revolutionary approach to maximize the potential of NVIDIA’s Hopper H800 GPUs. Their latest development, FlashMLA, is a striking example of this innovation, enabling significant performance boosts by optimizing memory usage and resource allocation for AI inference tasks.

During DeepSeek’s “OpenSource Week,” the company revealed the exciting FlashMLA project, a decoding kernel crafted specifically for NVIDIA’s Hopper GPUs. This innovation is available for public access through Github repositories, marking a significant milestone in AI technology.

FlashMLA has managed to push the performance of the Hopper H800 to an unprecedented 580 TFLOPS for BF16 matrix multiplication – that’s approximately eight times the typical industry rating. Even more impressive is the memory bandwidth, which, through efficient utilization, reaches up to 3000 GB/s, nearly double the H800’s theoretical limit. These remarkable enhancements are achieved through programming expertise rather than hardware modifications.

The magic behind FlashMLA lies in its use of “low-rank key-value compression,” which effectively breaks down data into more manageable chunks, allowing for faster processing and reduced memory consumption by up to 40%-60%. Moreover, the tool employs a block-based paging system that dynamically adjusts memory allocations based on task intensity, optimizing the handling of variable-length sequences and significantly boosting performance.

DeepSeek’s accomplishment with FlashMLA demonstrates the versatility and potential of AI computing through software ingenuity. While currently tailored for Hopper GPUs, the anticipation builds as we ponder the possible advancements the FlashMLA could bring to NVIDIA’s H100 accelerators. The future looks promising for the ongoing evolution of AI technology, driven by such innovative solutions.