Qwen3.5-122B-A10B: The Ultimate AI Party: Run Locally with Style

Tutorials3 months ago

Qwen3.5-122B-A10B: The Ultimate AI Party: Run Locally with Style

Descriptions:

Fahd Mirza demonstrates running the Qwen3.5-122B-A10B mixture-of-experts model on a single Nvidia H100 80GB GPU using llama.cpp — a feat he notes would have been impossible just months ago. The model packs 122 billion total parameters but activates only 10 billion per token through sparse expert routing across 256 specialists, making large-scale local inference feasible with the right quantization strategy.

The video covers downloading the Q4KM quantized version via Hugging Face Hub — a 15-shard package that loads to approximately 73GB of VRAM — and serving it via llama.cpp’s local server on port 8080. Mirza explains his preference for Q4KM: it balances accuracy and performance well, while lower quantization levels introduce quality degradation he considers unacceptable for production. The model supports a 262k context window, theoretically extensible to 1 million tokens, and uses a gated delta network architecture.

In testing, Qwen3.5-122B generates a polished, fully self-contained HTML landing page with animated GPU cards and Egyptian-themed styling in a single pass — completing the task notably faster than the earlier 27B dense model covered on the same channel. Mirza also provides a practical mental model for choosing between dense and MoE architectures: opt for dense when VRAM is limited and consistency matters most; choose MoE when you have headroom and need specialized reasoning capacity. The session rounds out a three-video series progressing from 27B dense to 35B MoE to this 122B flagship.

📺 Source: Fahd Mirza · Published February 25, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

No Image Available

Fahd Mirza

Tags

Fahd Mirza llama.cpp

Prev

Gemini 3.1 Pro in Antigravity can do anything… just watch

Gemini 3.1 Pro in Antigravity can do anything… just watch

Next

This simple Claude Cowork system saves 5 hours a week

This simple Claude Cowork system saves 5 hours a week

18 Related Posts

Related Posts

14:38

Tutorials

Using HiDream-O1 Natively in ComfyUI

1 hour ago

14:22

Tutorials

Codex Mobile Released and It’s Insane

1 hour ago

08:17

Tutorials

OpenAI Codex Now Works from Anywhere (Dispatch Killer?)

1 day ago

10:54

Tutorials

Talkie: I Ran a 1930 AI Model Locally and Talked to People from the Past

1 day ago

03:02

Tutorials

Installing Claude Code

1 day ago

08:41

Tutorials

Luce DFlash Meets OpenClaw – Local AI Agents at 2x Speed with Qwen3.6-27B

2 days ago