Qwen3.5 9B at 4-Bit: Intel’s Quantized Model Runs Locally with 4x Less VRAM

Qwen3.5 9B at 4-Bit: Intel’s Quantized Model Runs Locally with 4x Less VRAM

More

Descriptions:

Fahd Mirza demonstrates running Intel’s Auto Round INT4 quantized version of Qwen 3.5 9B locally using vLLM, covering both the practical setup and the underlying mechanism that makes this quantization method retain accuracy better than naive alternatives. The full BF16 model requires 18GB of VRAM; the INT4 version drops that to 4–6GB — a four-times reduction — while preserving reasoning quality for most tasks.

The technical explanation is a highlight: where naive INT4 quantization simply rounds each weight to the nearest 4-bit value, Intel’s Auto Round uses signed gradient descent to find the mathematically optimal rounding direction for each weight individually. This per-weight optimization is why the compressed model avoids the accuracy collapse that simpler quantization methods suffer. The video also clarifies why vLLM still loads weights in BF16 dtype at inference time: INT4 is the storage format, but matrix multiplications require higher precision to prevent compounding errors.

Two capability tests are run via Open WebUI on an Nvidia RTX 6000 48GB. The first asks the model to identify all downstream code changes from a TypeScript function signature update across multiple files — it correctly lists controllers, routes, tests, and API clients and adds backward-compatibility considerations. The second generates a complex self-contained HTML dashboard with real-time animated data visualization. Both outputs are evaluated honestly, making this a useful reference for developers evaluating INT4 quantization for production or edge deployments.


📺 Source: Fahd Mirza · Published March 06, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels