Descriptions:
Fahd Mirza demonstrates running the Qwen3.5-122B-A10B mixture-of-experts model on a single Nvidia H100 80GB GPU using llama.cpp — a feat he notes would have been impossible just months ago. The model packs 122 billion total parameters but activates only 10 billion per token through sparse expert routing across 256 specialists, making large-scale local inference feasible with the right quantization strategy.
The video covers downloading the Q4KM quantized version via Hugging Face Hub — a 15-shard package that loads to approximately 73GB of VRAM — and serving it via llama.cpp’s local server on port 8080. Mirza explains his preference for Q4KM: it balances accuracy and performance well, while lower quantization levels introduce quality degradation he considers unacceptable for production. The model supports a 262k context window, theoretically extensible to 1 million tokens, and uses a gated delta network architecture.
In testing, Qwen3.5-122B generates a polished, fully self-contained HTML landing page with animated GPU cards and Egyptian-themed styling in a single pass — completing the task notably faster than the earlier 27B dense model covered on the same channel. Mirza also provides a practical mental model for choosing between dense and MoE architectures: opt for dense when VRAM is limited and consistency matters most; choose MoE when you have headroom and need specialized reasoning capacity. The session rounds out a three-video series progressing from 27B dense to 35B MoE to this 122B flagship.
📺 Source: Fahd Mirza · Published February 25, 2026
🏷️ Format: Tutorial Demo







