Descriptions:
Two Minute Papers host Dr. Károly Zsolnai-Fehér breaks down Sonic, a new robot controller from NVIDIA’s humanoid robotics lab led by Jim Fan and Professor Zhu. The system enables a robot to accept commands in virtually any modality — live video of a human demonstrating a movement, spoken instructions, music, or plain text — and translate them into fluid, stable physical motion without requiring human-annotated action labels during training.
What makes Sonic technically notable is its scale efficiency. The final model contains just 42 million parameters — small enough to run on a smartphone — yet it was trained on 100 million frames of raw human motion using a pipeline that learns motion transitions without manual labeling. A key architectural detail is the root trajectory spring model, which dampens abrupt user commands through an exponential decay term so the robot settles smoothly at target positions without oscillating or injuring itself. The system encodes multimodal inputs through a motion generator and human encoder into universal tokens, which a decoder then maps to motor commands — a design that allows seamless switching between input types mid-task.
Training required 128 GPUs over three days, but the resulting models are being released openly and are lightweight enough for consumer hardware. The video is particularly useful for anyone tracking the convergence of large-scale motion data, small deployable models, and multimodal control as a viable path toward general-purpose robot behavior.
📺 Source: Two Minute Papers · Published April 25, 2026
🏷️ Format: Deep Dive







