NVIDIA Brings Instant DiffusionGemma Support to RTX and DGX, Hitting 150 Tokens per Second on DGX Spark

NVIDIA Brings Day-One RTX and DGX Support to Google DeepMind’s DiffusionGemma AI Model

Google DeepMind has introduced DiffusionGemma, a new open-weight AI model designed to make text generation faster, more efficient, and easier to run locally. Alongside the launch, NVIDIA has confirmed full support for the model across its GeForce RTX GPUs, RTX PRO workstations, and DGX systems, giving developers and AI enthusiasts a ready-made path to test and deploy the model from day one.

DiffusionGemma stands out because it uses a diffusion-based approach to text generation. Instead of producing text one token at a time like many traditional autoregressive models, it can denoise up to 256 tokens in parallel during each step. This allows the model to generate responses much faster, especially in local inference scenarios where speed and latency matter.

The model is built on Google’s Gemma 4 architecture and uses a 26-billion-parameter mixture-of-experts design. However, only around 3.8 billion parameters are active per step, helping improve efficiency while still delivering strong performance. DiffusionGemma supports text and image modalities, offers a context length of up to 256,000 tokens, and is available in BF16 and NVFP4 precision formats.

One of the biggest advantages of DiffusionGemma is that it is open-weight under the Apache 2.0 license. That means developers, researchers, and businesses can run it locally without relying on cloud services or paying per-token usage fees. NVIDIA is also supporting the model with optimized checkpoints and compatibility across popular AI development tools, making it easier to integrate into existing workflows.

According to NVIDIA, DiffusionGemma can deliver up to four times faster performance compared to a similar autoregressive model. On a single NVIDIA H100 Tensor Core GPU, the model can reach more than 1,000 tokens per second. On NVIDIA DGX Spark, performance is rated at over 150 tokens per second, while DGX Station can deliver up to 800 tokens per second for low-latency text generation and agentic AI workflows.

NVIDIA’s support covers a broad range of hardware, from high-end data center systems to local desktop setups. The company is using its Tensor Core architecture and CUDA software stack to accelerate DiffusionGemma without requiring users to perform extra tuning. This makes the model more accessible for developers who want fast local AI performance without spending time on complex optimization work.

DGX Spark is positioned as a personal AI supercomputer for local development, prototyping, research, and autonomous agent workflows. It is powered by the NVIDIA GB10 Grace Blackwell Superchip and includes 128GB of unified memory, offering enough headroom for local AI experimentation and fine-tuning.

For professional users, NVIDIA RTX PRO 6000 workstations provide the performance needed for low-latency generation, AI development, and agentic loops inside demanding production workflows. These systems are aimed at developers, researchers, creators, and AI professionals who need powerful local inference without depending on remote servers.

DGX Station is designed for users building and scaling advanced AI workloads at their desks. With 748GB of coherent memory and support for very large models, NVIDIA is presenting it as a high-speed local AI platform for frontier AI development, inference, and agent-based applications.

GeForce RTX GPUs are also part of the rollout, giving PC users and AI hobbyists a way to run DiffusionGemma on consumer hardware. NVIDIA says support through llama.cpp is coming soon, which should make local experimentation even more practical for desktop users.

DiffusionGemma’s combination of open access, fast parallel token generation, long context support, and NVIDIA hardware acceleration could make it an important model for local AI development. By removing cloud dependency and reducing latency, it gives users more control over privacy, cost, and performance.

For anyone with an RTX 5090 or a DGX Spark system, DiffusionGemma can already be tested using NVIDIA’s ready-to-use AI software stack. With day-one support across RTX, RTX PRO, and DGX platforms, NVIDIA is positioning its hardware ecosystem as a strong foundation for the next wave of open-weight AI models.

NVIDIA Brings Instant DiffusionGemma Support to RTX and DGX, Hitting 150 Tokens per Second on DGX Spark

Share this:

Related Posts: