Descriptions:
Fahd Mirza demonstrates how to install and run Qwen 3.6 35B-A3B locally — Alibaba’s latest mixture-of-experts model released as fully open weights. The model has 35 billion total parameters but activates only 3 billion at any given time through expert routing, delivering the knowledge base of a large dense model at a fraction of the compute cost. The tutorial runs on an Ubuntu system with a single NVIDIA H100 (80GB VRAM) using vLLM as the inference engine.
The video covers the full technical setup: downloading via Hugging Face CLI (26 shards), configuring vLLM with a 32K context cap to fit within the H100’s headroom after model weights consume roughly 70GB, setting GPU memory utilization to 0.90, enabling tool-calling flags for agentic frameworks, and correctly handling the model’s thinking blocks with the reasoning parser. The architecture is explained throughout — 40 layers alternating gated delta attention and standard gated attention feeding into a 256-expert MoE block, with a native 262K token context window extensible to 1 million. A notable new feature called thinking preservation retains the model’s reasoning chain across multi-turn conversations, which Mirza flags as significant for long agentic sessions.
Live tests include a tabbed CSS interface, a single-shot playable Flappy Bird-style HTML game (rendered with no external dependencies), and a multilingual announcement task. Benchmark comparisons against Qwen 3.5 and Claude Sonnet 4.5 are discussed, with the model showing competitive performance on agentic coding and multimodal understanding. A follow-up video covering OpenClaw integration for autonomous coding is referenced.
📺 Source: Fahd Mirza · Published April 16, 2026
🏷️ Format: Hands On Build







