Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind

Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind

More

Descriptions:

Google DeepMind researcher Cassidy Hardin presents a detailed technical breakdown of Gemma 4, the latest generation of Google’s open-source model family launched in late April 2026. The lineup spans four sizes: two on-device ‘effective’ models (2B and 4B) optimized for phones, iPads, and laptops, alongside a 26B mixture-of-experts model and a flagship 31B dense model. All Gemma 4 models ship under an Apache 2.0 license, a deliberate move to broaden commercial accessibility.

The 31B dense model ranks third on the global LM Arena leaderboard—outperforming models more than 20 times its size—with a 256k context window and native support for reasoning, function calling, and structured JSON outputs. The 26B MoE activates only 8 of its 128 experts per forward pass, requiring just 3.8 billion active parameters during inference. Architectural changes include a 5:1 interleaved local-to-global attention layer ratio, sliding window attention (1,024 tokens for larger models), and grouped query attention to reduce memory pressure.

Hardin goes deep on Per Layer Embeddings (PLE), the key innovation enabling the effective models’ on-device efficiency. PLE adds a dedicated 256-dimension embedding table per layer, stored in flash memory rather than VRAM—dramatically reducing the memory footprint that constrains mobile inference. The 2B effective model carries 35 layers and the 4B carries 42, with token representations refined at each stage. This architecture allows the effective models to significantly outperform prior Gemma generations at the same scale.


📺 Source: AI Engineer · Published April 27, 2026
🏷️ Format: Keynote Launch

1 Item

Channels

1 Item

Companies