Descriptions:
NVIDIA’s Nemotron Cascade 30B-A3B is a mixture-of-experts model with 30 billion total parameters, but only 3 billion are active during any single inference pass — making it far more compute-efficient than its parameter count implies. In this hands-on walkthrough, Fahd Mirza installs and runs the model on an NVIDIA A100 80GB GPU, covering the full setup process from conda virtual environment creation and PyTorch/Transformers installation to downloading the ~50–60GB model weights from Hugging Face and loading them in a Jupyter notebook. At inference, the model consumes approximately 64GB of VRAM.
A significant portion of the video is devoted to explaining Nemotron Cascade’s multi-stage post-training pipeline, which includes supervised fine-tuning on math, code, science, and tool-use data; instruction-following RL; multi-domain RL across STEM reasoning and structured outputs; and a novel step called Multi-domain On-Policy Distillation (MOPD). In MOPD, the model trains against its own strongest domain-specific checkpoints from earlier in the pipeline — the best math checkpoint teaches math, the best alignment checkpoint teaches alignment — recovering skills that tend to degrade during RL stages. The pipeline concludes with RLHF, long-context RL, competitive programming RL on 3,500 hard problems, and software engineering agent RL.
The result is a model that achieved gold medal performance on the International Mathematical Olympiad (IMO) and International Collegiate Programming Contest (ICPC) benchmarks, outperforming models two to four times its size on coding and math, despite being roughly 20 times smaller than the only other model previously reaching that level. Mirza tests it on a complex creative coding prompt and notes strong reasoning performance alongside weaker general knowledge — a known limitation of the base model rather than the post-training pipeline.
📺 Source: Fahd Mirza · Published March 20, 2026
🏷️ Format: Tutorial Demo







