DiffusionGemma: 1100 Tokens/sec: Google’s Fastest Open Model Yet Locally

DiffusionGemma: 1100 Tokens/sec: Google’s Fastest Open Model Yet Locally

More

Descriptions:

Fahd Mirza installs and stress-tests Google DeepMind’s DiffusionGemma — a 26-billion-parameter mixture-of-experts model that abandons autoregressive token generation in favor of discrete diffusion. Rather than predicting one token at a time, the model starts with a canvas of 256 random noisy tokens and refines all of them simultaneously across multiple denoising passes, enabling bidirectional attention where every token sees every other token. The result: speeds exceeding 1,100 tokens per second on a single GPU, with only 3.8 billion parameters active during inference.

The setup is demonstrated on Ubuntu with an NVIDIA H100 (80GB VRAM) using both VLLM (serving an OpenAI-compatible local endpoint with a 256K context window) and Hugging Face Transformers. At full precision the model consumes approximately 50GB of VRAM; quantized versions fit within around 18GB. Mirza works through a common CUDA library path error users will encounter and shows both serving approaches side by side.

Capability tests include generating a complex animated SVG depicting real-time tectonic plate movement, building a responsive four-tab UI with CSS and JavaScript from a single prompt, and a multimodal vision task asking the model to judge whether a car can pass beneath a barrier in a photograph. DiffusionGemma is Apache 2.0 licensed and supports text, image, and video inputs, positioning it as a notable open alternative for latency-sensitive inference workloads.


📺 Source: Fahd Mirza · Published June 10, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels