Descriptions:
NVIDIA has released Nemotron 3 Ultra, a 550 billion parameter mixture-of-experts model built specifically for agentic workloads, and in this video Sam Witteveen breaks down the architecture, training methodology, and real benchmark results in detail. The model features 55 billion active parameters, a one-million token context window, multi-token prediction support, and is designed to compete with frontier proprietary models from Anthropic, OpenAI, and Google — while remaining open-weights and deployable on-premises.
Witteveen digs into the novel training technique central to the model: multi-tier on-policy distillation. Rather than training a single general model directly, NVIDIA trained separate teacher models specialized for code, tool use, and instruction following, then distilled all of them into the final Nemotron Ultra — the same approach companies like LinkedIn have used to build custom open-weights deployments for hundreds of millions of users. NVIDIA is also releasing the reinforcement learning training environments used in post-training, which Witteveen argues could meaningfully benefit the broader open-source community regardless of whether developers adopt this specific model.
On benchmarks, Nemotron Ultra outperforms significantly larger models including GLM’s one-trillion parameter variant, and achieves over 300 tokens per second according to Artificial Analysis — considerably faster than comparable Chinese open models like Kimi and GLM. Witteveen also live-tests the model. For engineering teams evaluating open-weights alternatives to proprietary APIs, this video offers a technically detailed and hands-on look at what Nemotron 3 Ultra actually delivers.
📺 Source: Sam Witteveen · Published June 04, 2026
🏷️ Format: Deep Dive






