Descriptions:
Sam Witteveen covers the launch of NVIDIA Cosmos 3, announced at the GTC Taipei conference, describing it as a significant step forward in world foundation models for physical AI. Unlike its predecessors, which separated prediction, transfer, and generation into distinct models, Cosmos 3 unifies five modalities — text, images, video, audio, and robotic actions — into a single architecture capable of both understanding and generating across all five.
The architecture uses what NVIDIA calls a mixture of transformers with a dual-tower design: an autoregressive ‘reasoner’ tower handles input understanding, while a diffusion-based ‘generator’ tower handles output synthesis. The two towers share multimodal attention, allowing the model to go from text or image input to video or action output in a single unified pass. Cosmos 3 Super uses a 32B parameter model per tower; Cosmos 3 Nano uses 8B per tower. NVIDIA also references an unreleased edge variant intended for real-time, on-device inference. The model makes strong use of existing components including Qwen3 VL (8B and 32B) and reuses VAEs from Cosmos 1.2.2.
Witteveen runs inference on Cosmos 3 Nano using a DGX Spark, demonstrating video generation for robotic arm training data — synthetic sequences showing a cabinet of fruit being manipulated. He highlights NVIDIA’s unusually transparent technical report, which breaks down both pre-training and supervised fine-tuning data sources in detail, and argues this makes Cosmos 3 a practical starting point for teams wanting to fine-tune on their own physical AI datasets.
📺 Source: Sam Witteveen · Published June 01, 2026
🏷️ Format: Deep Dive







