The AI Frontier: from Gemini 3 Deep Think distilling to Flash — Jeff Dean

The AI Frontier: from Gemini 3 Deep Think distilling to Flash — Jeff Dean

More

Descriptions:

Jeff Dean, Chief AI Scientist at Google DeepMind, joins Latent Space for an expansive technical conversation covering the Gemini model family’s design philosophy, the role of distillation in Google’s model strategy, and a remarkably detailed first-principles analysis of AI accelerator hardware economics — grounded in Dean’s decades of experience co-designing TPUs alongside the teams that train on them.

On the model side, Dean explains how distillation — a technique he co-developed in 2014 originally to compress image classification ensembles — has become central to delivering Gemini Flash and other efficient variants: you cannot build a highly capable small model without first having a frontier model to distill from, making the expensive frontier investment a prerequisite rather than an alternative to efficient deployment. He discusses how sparse architectures and mixture-of-experts approaches are being revisited as hardware has evolved, and how Google balances its obligation to billions of existing users against the need to push the capability frontier.

The hardware discussion is among the most technically rich available from any Google executive. Dean provides an energy-based explanation of why batching is economically necessary on TPUs: moving a model parameter from on-chip SRAM to a multiply unit costs roughly 1,000 pico-joules, while computing with it costs ~1 pico-joule — making batch-size-one inference deeply wasteful. He connects this to the economics of custom ASICs per model at billion-dollar training run scales, Google’s 3D mesh TPU topology, and the SRAM vs. HBM tradeoff for serving smaller models spread across many chips. Essential viewing for anyone working at the intersection of model development and infrastructure.


📺 Source: Latent Space · Published February 12, 2026
🏷️ Format: Interview

1 Item

Companies