Moonlake: Multimodal, Interactive, and Efficient World Models — with Fan-yun Sun and Chris Manning

Moonlake: Multimodal, Interactive, and Efficient World Models — with Fan-yun Sun and Chris Manning

More

Descriptions:

Moonlake is a startup building interactive world models for training embodied AI agents, and this Latent Space episode features the company’s two leads: Fan-yun Sun, the CEO and PhD researcher who previously worked with Nvidia Research on synthetic data pipelines, and Chris Manning, the legendary Stanford NLP professor who serves as co-founder. Together they explain why world models represent a fundamentally different — and in their view, better — path toward general intelligence than pure video generation.

Sun traces the genesis of Moonlake to her Nvidia work demonstrating that synthetically generated interactive environments could match real-world data for multimodal pretraining, while also observing that demand for such environments (from robotics firms, autonomous systems labs, and academic groups) was growing faster than supply. The key architectural insight is a separation of concerns: a multimodal reasoning model handles causality, logical persistence, and the consequences of agent actions, while a second model called Rey — a diffusion model trained on top of the reasoning model’s abstract world representations — handles photorealistic rendering without violating spatial consistency.

Manning frames the work in terms of his broader conviction that language alone is insufficient for general intelligence, and that truly interactive multimodal environments are necessary. The discussion also covers evaluation challenges (good world-model benchmarks are harder to construct than question-answering benchmarks), the commercial pull from robotics and embodied AI customers, and the team’s belief that neural rendering of this kind will eventually replace traditional rasterization pipelines like DLSS.


📺 Source: Latent Space · Published April 02, 2026
🏷️ Format: Podcast