Qwen3.5 9B at 4-Bit: Intel’s Quantized Model Runs Locally with 4x Less VRAM

Tutorials2 months ago

Qwen3.5 9B at 4-Bit: Intel’s Quantized Model Runs Locally with 4x Less VRAM

Descriptions:

Fahd Mirza demonstrates running Intel’s Auto Round INT4 quantized version of Qwen 3.5 9B locally using vLLM, covering both the practical setup and the underlying mechanism that makes this quantization method retain accuracy better than naive alternatives. The full BF16 model requires 18GB of VRAM; the INT4 version drops that to 4–6GB — a four-times reduction — while preserving reasoning quality for most tasks.

The technical explanation is a highlight: where naive INT4 quantization simply rounds each weight to the nearest 4-bit value, Intel’s Auto Round uses signed gradient descent to find the mathematically optimal rounding direction for each weight individually. This per-weight optimization is why the compressed model avoids the accuracy collapse that simpler quantization methods suffer. The video also clarifies why vLLM still loads weights in BF16 dtype at inference time: INT4 is the storage format, but matrix multiplications require higher precision to prevent compounding errors.

Two capability tests are run via Open WebUI on an Nvidia RTX 6000 48GB. The first asks the model to identify all downstream code changes from a TypeScript function signature update across multiple files — it correctly lists controllers, routes, tests, and API clients and adds backward-compatibility considerations. The second generates a complex self-contained HTML dashboard with real-time animated data visualization. Both outputs are evaluated honestly, making this a useful reference for developers evaluating INT4 quantization for production or edge deployments.

📺 Source: Fahd Mirza · Published March 06, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

No Image Available

Fahd Mirza

Tags

VLLM

Prev

Higgsfield’s NEW Soul 2.0 AI Image Generator is AMAZING

Higgsfield’s NEW Soul 2.0 AI Image Generator is AMAZING

Next

India Enters the AI Race: Running Sarvam-30B Locally

India Enters the AI Race: Running Sarvam-30B Locally

18 Related Posts

Related Posts

14:38

Tutorials

Using HiDream-O1 Natively in ComfyUI

1 hour ago

14:22

Tutorials

Codex Mobile Released and It’s Insane

1 hour ago

08:17

Tutorials

OpenAI Codex Now Works from Anywhere (Dispatch Killer?)

1 day ago

10:54

Tutorials

Talkie: I Ran a 1930 AI Model Locally and Talked to People from the Past

1 day ago

03:02

Tutorials

Installing Claude Code

1 day ago

08:41

Tutorials

Luce DFlash Meets OpenClaw – Local AI Agents at 2x Speed with Qwen3.6-27B

2 days ago