Building Generative Image & Video models at Scale – Sander Dieleman (Veo and Nano Banana)

Building Generative Image & Video models at Scale – Sander Dieleman (Veo and Nano Banana)

More

Descriptions:

Sander Dieleman, research scientist at Google DeepMind and a member of the generative media team behind Veo and Imagen (Nano Banana), presents a behind-the-scenes technical overview of what goes into training large-scale generative image and video models. Delivered at the AI Engineer conference, the talk spans eight structured sections and offers a rare insider perspective on production diffusion system design from one of the field’s leading labs.

The presentation moves through data curation — which Dieleman argues is critically underrated relative to model architecture work, and remains difficult to publish on given competitive sensitivity — followed by latent representations (explaining why pixel-space training gave way to compressed latent spaces as scale increased), the core mechanics of diffusion models grounded in Fourier frequency analysis, neural network architecture choices, and training-at-scale considerations. He then covers sampling strategies unique to diffusion models, distillation techniques for reducing inference steps without shrinking model size, and the control signals used to make models reliably follow user intent.

The frequency-domain explanation of why diffusion models work so well for visual data — connecting power-law image spectra, Gaussian noise characteristics, and the coarse-to-fine generation process — is a standout section rarely treated at this depth in public talks. For ML engineers, researchers, and technically minded practitioners building or studying generative media systems, this is high-signal reference material from someone actively training state-of-the-art models at Google DeepMind.


📺 Source: AI Engineer · Published April 21, 2026
🏷️ Format: Deep Dive

1 Item

Channels

1 Item

Companies