AMD is moving quickly to make Google’s new Gemma 4 models easy to run across a wide range of PCs and servers. With day-one support now in place, Gemma 4 can be deployed on AMD hardware spanning everything from consumer Radeon graphics cards and Ryzen AI laptops to powerful Instinct accelerators used in cloud and enterprise data centers.
Gemma 4 is Google’s latest open-weights AI model family, offered in multiple sizes so developers can choose the right balance of speed, memory use, and output quality. The lineup ranges from smaller 2B models up to 31B, including a Mixture-of-Experts (MoE) option. The big takeaway is flexibility: Gemma 4 can scale from local experimentation on a desktop PC to high-throughput serving in a data center environment.
AMD says its support covers the full set of Gemma 4 models across its AI-enabled portfolio. That includes AMD Instinct GPUs for data centers, AMD Radeon GPUs for AI workstations and creator PCs, and AMD Ryzen AI processors designed for modern AI PCs. Just as importantly, this support lands inside popular tools many developers already use, including LM Studio and a range of widely adopted open-source projects such as vLLM, SGLang, llama.cpp, Ollama, and Lemonade.
One of the most practical ways to serve Gemma 4 on AMD GPUs is through vLLM, an inference framework tuned for efficient serving and handling multiple concurrent requests. AMD notes that Gemma 4 works across the range of GPUs supported by vLLM, spanning multiple generations of Instinct and Radeon products. For now, vLLM deployments can use the TRITON_ATTN attention backend, and AMD also indicates that additional attention backends with further optimizations—especially for MI300 and MI350-series hardware—are planned for the near future.
For high-performance deployment on select data center GPUs, Gemma 4 is also supported through SGLang on AMD MI300X, MI325X, and MI35X accelerators. SGLang supports the whole Gemma 4 family, covering the dense models (E2B, E4B, 31B) and the MoE variant (26B-A4B). AMD highlights that these Gemma 4 models require the Triton attention backend for bidirectional image-token attention, which is a key capability for multimodal workflows. Another notable point for production planning: the Gemma 4 model can fit on a single MI300X GPU with its large 192 GB HBM capacity at tensor parallelism set to 1, while higher-throughput workloads can scale using higher tensor parallelism.
On the local PC side, Gemma 4 is positioned to be straightforward to run using LM Studio via the open-source llama.cpp ecosystem. That means users can spin up Gemma 4 on supported hardware such as Ryzen AI and Ryzen AI Max processors, along with Radeon and Radeon PRO graphics cards. Pairing LM Studio with the latest AMD Software: Adrenalin Edition drivers is presented as the fast path to getting up and running, making Gemma 4 accessible for local chat, development, testing, and offline workflows.
AMD is also spotlighting Lemonade Server for those who want a local LLM server with OpenAI-compatible APIs. The idea here is convenience: developers can slot a Gemma 4 model behind a familiar API interface, while still taking advantage of AMD acceleration. Lemonade can accelerate on Radeon and Radeon PRO GPUs via ROCm, and it can also use the XDNA 2 NPU found in newer Ryzen AI processors.
On the NPU front, AMD says developers will be able to deploy Gemma 4 models through Lemonade Server with support for the latest XDNA 2 NPU. NPU enablement for the Gemma-4 E2B and E4B models is expected to arrive in the next Ryzen AI software update. AMD also notes that this update will be integrated into Lemonade and made available to developers directly through ONNX Runtime APIs, which should make it easier to integrate Gemma 4 into Windows and cross-platform AI applications that already rely on ONNX tooling.
For developers and enthusiasts searching for an efficient way to run Google Gemma 4 on AMD GPUs or Ryzen AI laptops, the message is clear: broad compatibility is already here, deployment options cover both data center serving and local on-device use, and additional performance tuning for newer accelerator generations is on the way.






