NVIDIA Boosts Llama 3.1 By 1.9x With Decoding Algorithm "Medusa" 1

Revolutionary Medusa Algorithm Catapults Llama 3.1 Performance by 1.9x: NVIDIA’s Latest Triumph

NVIDIA is pushing the boundaries of AI performance with its HGX H200 AI accelerators, now enhanced by the company’s proprietary decoding algorithm, “Medusa.” This leap forward is particularly evident in the context of the Llama 3.1 model, showcasing NVIDIA’s deep commitment to advancing the software ecosystem to achieve stellar performance gains.

As large language models (LLMs) scale in complexity, delivering low latency and high throughput for real-time AI applications becomes increasingly crucial. Multi-GPU compute has become essential, reliant on ultra-fast GPU-to-GPU communication and sophisticated software to fully leverage multiple GPUs. Techniques like tensor parallelism and speculative decoding play a pivotal role in this dynamic. By distributing model layer calculations across GPUs and employing advanced algorithms, token generation latency is significantly reduced, ensuring a more interactive user experience.

For exceptionally low latency when serving Llama 3.1, cloud services can harness the full power of an NVIDIA HGX H200 server, equipped with eight H200 Tensor Core GPUs and four NVLink Switch chips, allowing each GPU to communicate at a blazing 900 GB/s. This high bandwidth is vital to maintain efficient multi-GPU communication, preventing bottlenecks in interactive scenarios.

NVIDIA’s HGX H200 systems leverage TensorRT-LLM, an open-source TensorRT library designed to optimize the latest LLMs. TensorRT-LLM utilizes techniques like tensor parallelism and speculative decoding to push the boundaries of inference performance. Notably, the speculative decoding algorithm “Medusa” is engineered to deliver remarkable low-latency performance, with Llama 3.1 70B achieving 268 tokens/second/user and Llama 3.1 405B reaching 108 tokens/second/user on the HGX H200.

Medusa significantly enhances token generation, up to 1.9 times faster on the HGX H200. Traditional transformer-based LLMs generate tokens sequentially, limiting throughput to one token per generation step and potentially underutilizing the Tensor Core capabilities. To overcome this, speculative decoding leverages a “draft model” to predict multiple subsequent tokens, which are then validated in parallel with the next token. This approach optimizes the use of available GPU resources, accelerating token generation.

Medusa, specifically, uses the original model as the draft model, bypassing the complexity of separate draft models. With Medusa, additional decoding “heads” predict token distributions beyond the next token, streamlining the generation process. The result is a substantial boost in performance, with Llama 3.1 70B and Llama 3.1 405B generating tokens 1.5x and 1.9x faster respectively.

These Medusa heads are trained using the NVIDIA TensorRT Model Optimizer integrated with the NVIDIA NeMo framework, ensuring that the speculative decoding yields the same accuracy as the base model while speeding up the process.

NVIDIA’s commitment to innovation spans across every layer of the technology stack—from chips to systems and software libraries to algorithms. The NVIDIA HGX H200 with NVLink Switch and TensorRT-LLM provides exceptional real-time inference performance. As NVIDIA continues to refine and enhance its platform, users can look forward to even more impressive advancements in low latency inference performance, enhancing the overall user experience and reducing inference costs.