Descriptions:
Fahd Mirza delivers a same-day setup guide for Qwen3.5 35B-A3B, a Mixture of Experts model from the Qwen series, running on an Ubuntu system with an Nvidia RTX 6000 (48GB VRAM). The model has 35 billion total parameters but activates only 3 billion per token — routing each input through 9 of 256 specialist expert layers — giving it the knowledge breadth of a large model at roughly the inference cost of a 3B dense model.
The tutorial covers downloading the Q8 quantized build (approximately 37GB) from unsloth via Hugging Face Hub, serving it through llama.cpp with full GPU offloading, and verifying VRAM consumption, which stays under 37GB at full Q8 quant. Key benchmark results include GPQA Diamond at 84.2, instruction following at 91.9, and competitive scores on agentic coding and multilingual tasks. A direct comparison with the 27B dense model shows the MoE variant completing chain-of-thought reasoning significantly faster — where the dense model took 5–6 minutes for thinking, the MoE model’s reasoning phase is notably shorter.
Qualitative tests include generating a Mars electrical storm simulation as a single self-contained HTML file using vanilla JavaScript and CSS — producing a particle system with procedural lightning and physics — and a jailbreak safety test using an emotional manipulation prompt, which the model correctly refuses. Despite strong MoE performance, Mirza recommends the 27B dense model for production use where VRAM permits, citing consistently high output quality.
📺 Source: Fahd Mirza · Published February 25, 2026
🏷️ Format: Tutorial Demo







