Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

More

Descriptions:

Ethan He, a former xAI engineer who built Grok Imagine from scratch, joins the Latent Space podcast for a detailed technical conversation on video generation, world models, and the architectural decisions behind one of the fastest model launches in recent memory. He describes joining xAI in mid-2025 when the video team had no infrastructure, no data, and no model — and shipping Grok Imagine 0.9 within three months with a small group of engineers.

Grok Imagine 0.9 is notable for being what He claims is the first audio-video joint generation transformer deployed at scale. The alignment challenge was significant: audio has a discrete component (speech, approximable as tokens) and a continuous component (music, which cannot be modeled as discrete tokens), and most language models at the time handled neither well in the context of synchronized video. The conversation covers how these modalities were brought together, including the use of synthetic data pipelines and cross-modal alignment techniques.

The interview goes deep on diffusion model distillation — covering consistency models, distribution matching distillation using GAN discriminators for one-step generation, and the tradeoffs involved in reducing inference steps without losing quality. He also offers a broader observation: improvements in video generation models are now driven more by language model quality than by video-specific architecture advances. He’s background spans Cosmos at Nvidia — a world-scale video foundation model for robotics — through Grok Imagine pre-training and post-training, including reference-video conditioning and long-horizon real-time video generation. A technically substantive interview for engineers working on multimodal AI systems.


📺 Source: Latent Space · Published June 01, 2026
🏷️ Format: Interview

1 Item

Companies