How DeepMind’s New AI Predicts What It Cannot See

How DeepMind’s New AI Predicts What It Cannot See

More

Descriptions:

Two Minute Papers host Dr. Károly Zsolnai-Fehér breaks down D4RT (pronounced “dart”), a new model from Google DeepMind that performs full 4D scene reconstruction — three spatial dimensions plus time — from ordinary video input. Where earlier approaches required separate specialized models for depth estimation, motion tracking, and camera pose, D4RT handles all three inside a single transformer architecture, eliminating the “test-time optimization” step that made prior pipelines slow and brittle.

The architectural insight is parallelization: an encoder builds a global scene representation, then independent decoder queries reconstruct individual points at specific timestamps without needing to communicate with each other. This design allows D4RT to scale to millions of parallel queries and achieves speeds up to 300 times faster than Gaussian splat-based methods. The model can also track points through occlusion — predicting the position of objects it cannot currently see based on their trajectory before and after they disappear from frame.

The video provides an unusually honest tradeoff analysis. D4RT outputs point clouds rather than meshes or splats, meaning the geometry is not directly editable in tools like Blender, cannot be used for physics collisions without additional processing, and does not produce photorealistic renders. Gaussian splats and polygon meshes remain superior for visual fidelity and creative editing. D4RT’s strengths are speed, dynamic scene handling, and geometric accuracy — making it well-suited for robotics, augmented reality, sports analytics, and any pipeline that needs fast structural understanding of moving scenes.


📺 Source: Two Minute Papers · Published March 07, 2026
🏷️ Format: Deep Dive

1 Item

Channels

1 Item

Companies