Comparing Full Precision vs Ollama Version of Qwen3.6-35B-A3B Locally

Comparing Full Precision vs Ollama Version of Qwen3.6-35B-A3B Locally

More

Descriptions:

Fahd Mirza runs a direct head-to-head comparison of Qwen 3.6 35B-A3B (a 35-billion-parameter mixture-of-experts model) in two configurations: full precision served via vLLM at 65 GB, and the Ollama Q4KM quantized version at 23 GB, both running on a single NVIDIA H100 with 80 GB VRAM. The test methodology is concrete — both models receive identical prompts, and generated code is compiled with GCC and executed to observe runtime behavior, not just syntactic correctness.

The key finding: both models produce compilable code for a Minesweeper game with a moving-mines twist, but the quality difference becomes visible at runtime. The full-precision version’s code triggers recursive flood-fill correctly, revealing 35 connected cells on a single click — a fundamental Minesweeper mechanic. The Q4KM Ollama version compiles with two minor warnings and only reveals the single clicked cell, missing the flood-fill logic entirely. Mirza explains Q4KM quantization in detail: 4-bit integer storage with K-means clustering for intelligent weight grouping, cutting memory requirements by roughly 75% while the “M” (medium) variant balances compression against quality preservation.

The practical takeaway is that Q4KM quantization handles simple tasks without visible degradation but introduces meaningful logic gaps in more complex code generation — a useful calibration point for developers choosing between full-precision and quantized local deployments of large open-weight models.


📺 Source: Fahd Mirza · Published April 18, 2026
🏷️ Format: Benchmark Test

1 Item

Channels