DiffusionGemma: 1100 Tokens/sec: Google’s Fastest Open Model Yet Locally

Tutorials2 months ago

DiffusionGemma: 1100 Tokens/sec: Google’s Fastest Open Model Yet Locally

Descriptions:

Fahd Mirza installs and stress-tests Google DeepMind’s DiffusionGemma — a 26-billion-parameter mixture-of-experts model that abandons autoregressive token generation in favor of discrete diffusion. Rather than predicting one token at a time, the model starts with a canvas of 256 random noisy tokens and refines all of them simultaneously across multiple denoising passes, enabling bidirectional attention where every token sees every other token. The result: speeds exceeding 1,100 tokens per second on a single GPU, with only 3.8 billion parameters active during inference.

The setup is demonstrated on Ubuntu with an NVIDIA H100 (80GB VRAM) using both VLLM (serving an OpenAI-compatible local endpoint with a 256K context window) and Hugging Face Transformers. At full precision the model consumes approximately 50GB of VRAM; quantized versions fit within around 18GB. Mirza works through a common CUDA library path error users will encounter and shows both serving approaches side by side.

Capability tests include generating a complex animated SVG depicting real-time tectonic plate movement, building a responsive four-tab UI with CSS and JavaScript from a single prompt, and a multimodal vision task asking the model to judge whether a car can pass beneath a barrier in a photograph. DiffusionGemma is Apache 2.0 licensed and supports text, image, and video inputs, positioning it as a notable open alternative for latency-sensitive inference workloads.

📺 Source: Fahd Mirza · Published June 10, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

No Image Available

Fahd Mirza

Tags

DeepMind Gemma 4 Google Transformers VLLM

Prev

How To Use Claude Fable 5 – Tips And Tricks Most People Miss

Next

Google’s Agents CLI: The CLI + Skills Combination to Ship AI Agents EASILY

18 Related Posts

Related Posts

08:04

Tutorials

Herdr: Run Multiple AI Coding Agents in Parallel from Your Terminal

1 hour ago

15:54

Tutorials

Buzz Huddle Test: 4 Humans, 2 AI Agents

1 hour ago

15:54

Tutorials

AI Video 101: How to Master AI Videos (Beginner to Advanced)

1 day ago

08:12

Tutorials

How to Run Kimi K3 Locally (3 Ways)

1 day ago

55:16

Tutorials

Claude Code + Codex Can FINALLY Work Together (Buzz AI)

1 day ago

22:53

Tutorials

The Viral $1 Website Effect That Looks Like $10K (Tutorial)

1 day ago