Best Qwen3.6 Quant You Can Run Right Now Locally

Tutorials2 months ago

Best Qwen3.6 Quant You Can Run Right Now Locally

Descriptions:

Fahd Mirza examines Nvidia’s official FP4 quantization of Qwen3 35B A22B — a release validated by Nvidia’s own model optimizer tool against the original BF16 weights, with a reported accuracy delta of less than 0.5%. The video explains why this quantization level demands Hopper or Blackwell GPU architecture (H100, H200, B200, RTX 5000 series) and cannot run efficiently on Ampere cards like the RTX 4090.

The technical core of the video is an explanation of NVFP4’s micro-block scaling: groups of 16 weight values share a high-precision FP8 scaling factor, which preserves accuracy far better than naive int4 rounding. Mirza then walks through pulling the model from Hugging Face (~23.5 GB), serving it with a nightly vLLM Docker build, and observing over 75 GB of VRAM consumption at a large context window. He also systematically compares the three main release variants: Nvidia’s official version (bundles the MTP draft model, drops the vision encoder), Red Hat AI’s release (separates the MTP file and retains vision inputs), and community repacks (essentially mirrors of the prior two).

A practical coding benchmark rounds out the video: the model is asked to generate a complete, self-contained animated HTML5 canvas application in a single pass — a multi-stage medieval castle construction animation — to stress-test one-shot code generation. For teams evaluating quantized open-weight models for GPU-constrained inference workloads, this is a useful capability and hardware-requirements reference.

📺 Source: Fahd Mirza · Published May 31, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

No Image Available

Fahd Mirza

1 Item

Companies

No Image Available

Nvidia

Tags

Blackwell H100 H200 NVFP4 Nvidia VLLM

Prev

Weekly AI Recap — Opus 4.8, Step Audio 3, Bonsai Image and More | May 2026

Next

Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

18 Related Posts

Related Posts

08:04

Tutorials

Herdr: Run Multiple AI Coding Agents in Parallel from Your Terminal

2 hours ago

15:54

Tutorials

Buzz Huddle Test: 4 Humans, 2 AI Agents

2 hours ago

22:53

Tutorials

The Viral $1 Website Effect That Looks Like $10K (Tutorial)

1 day ago

20:17

Tutorials

Paste This Into Claude, Never Hit a Token Limit Again

1 day ago

15:54

Tutorials

AI Video 101: How to Master AI Videos (Beginner to Advanced)

1 day ago

08:12

Tutorials

How to Run Kimi K3 Locally (3 Ways)

1 day ago