Descriptions:
Fahd Mirza tests Qwopus Coder, a 35-billion-parameter mixture-of-experts coding model built on the Qwen 3.6 architecture (3B parameters active per token), with a specific focus on its built-in Multi-Token Prediction (MTP) capability. The setup runs on a single Nvidia RTX 6000 Ada with 48 GB VRAM consuming just over 23 GB, served through llama.cpp with speculative decoding enabled via the `–draft-mtp` flag and a max draft of three tokens ahead โ no secondary draft model required, since the draft heads are baked into the weights.
The practical test uses a deliberately broken full-stack call center dashboard โ a FastAPI backend with SQLite and a plain HTML frontend โ with planted bugs including a port mismatch between frontend and backend and multiple backend logic errors. Qwopus Coder, orchestrated through the Hermes agent framework, autonomously identifies and fixes all bugs without hints in a single agentic loop, verifying each fix before moving to the next. Post-run llama.cpp server logs show a 98.7% draft token acceptance rate and throughput of approximately 160 tokens per second on the 35B model โ a strong result for single-GPU local inference on a model of this size.
Mirza explains MTP clearly: instead of one forward pass per token, the model’s integrated draft heads predict several tokens simultaneously from already-computed hidden states, with matching drafts kept at no additional cost. The video concludes that Qwopus Coder’s combination of agentic coding capability and high local throughput makes it a compelling option for developers running inference on prosumer or workstation-class hardware.
๐บ Source: Fahd Mirza ยท Published July 01, 2026
๐ท๏ธ Format: Benchmark Test







