Best Qwen3.6 Quant You Can Run Right Now Locally

Best Qwen3.6 Quant You Can Run Right Now Locally

More

Descriptions:

Fahd Mirza examines Nvidia’s official FP4 quantization of Qwen3 35B A22B — a release validated by Nvidia’s own model optimizer tool against the original BF16 weights, with a reported accuracy delta of less than 0.5%. The video explains why this quantization level demands Hopper or Blackwell GPU architecture (H100, H200, B200, RTX 5000 series) and cannot run efficiently on Ampere cards like the RTX 4090.

The technical core of the video is an explanation of NVFP4’s micro-block scaling: groups of 16 weight values share a high-precision FP8 scaling factor, which preserves accuracy far better than naive int4 rounding. Mirza then walks through pulling the model from Hugging Face (~23.5 GB), serving it with a nightly vLLM Docker build, and observing over 75 GB of VRAM consumption at a large context window. He also systematically compares the three main release variants: Nvidia’s official version (bundles the MTP draft model, drops the vision encoder), Red Hat AI’s release (separates the MTP file and retains vision inputs), and community repacks (essentially mirrors of the prior two).

A practical coding benchmark rounds out the video: the model is asked to generate a complete, self-contained animated HTML5 canvas application in a single pass — a multi-stage medieval castle construction animation — to stress-test one-shot code generation. For teams evaluating quantized open-weight models for GPU-constrained inference workloads, this is a useful capability and hardware-requirements reference.


📺 Source: Fahd Mirza · Published May 31, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

1 Item

Companies