Intel Squeezed Gemma-4 31B into INT4 – Run It Locally with Half the Memory

Tutorials1 month ago

Intel Squeezed Gemma-4 31B into INT4 – Run It Locally with Half the Memory

Descriptions:

Fahd Mirza demonstrates how Intel’s AutoRound toolkit can quantize Google’s Gemma 4 31B multimodal model from FP16 down to INT4, cutting VRAM requirements from over 60GB to under 19GB while preserving the model’s reasoning and vision capabilities. The video walks through the full setup on a system running Sydney, Australia, covering Conda environment creation, installation of PyTorch, Transformers, and AutoRound, and the key parameters: 4-bit integers, group size 128, symmetric quantization.

A key portion of the video explains what differentiates AutoRound from naive quantization methods. Rather than accepting the accuracy loss that comes with rounding weights to INT4, AutoRound runs a calibration loop using sign gradient descent — iteratively adjusting quantization scaling factors (alpha and beta) to minimize the error between dequantized INT4 weights and the original FP16 values. The result, Mirza argues, is compression without the typical accuracy cliff.

The quantized model is tested on multimodal tasks: analyzing a heavily congested traffic image (correctly identifying vehicle types, lane count, and movement state) and translating sentences from approximately 30 languages into English while identifying each source language. CPU inference is also demonstrated, though Mirza notes the significant speed penalty on even a high-core-count system. For practitioners wanting to run a capable 31B multimodal model on mid-tier or consumer GPU hardware, the video provides a reproducible path using Intel’s open-source tooling.

📺 Source: Fahd Mirza · Published April 10, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

No Image Available

Fahd Mirza

Tags

Gemma 4 31B Google Intel Unsloth

Prev

LFM2.5‑VL-450M: Liquid AI’s Tiny 450M Vision Model Does More Than You Expect

LFM2.5‑VL-450M: Liquid AI’s Tiny 450M Vision Model Does More Than You Expect

Next

Seedance 2.0 + Claude Code Creates $10k Websites in Minutes

Seedance 2.0 + Claude Code Creates $10k Websites in Minutes

18 Related Posts

Related Posts

14:38

Tutorials

Using HiDream-O1 Natively in ComfyUI

1 hour ago

14:22

Tutorials

Codex Mobile Released and It’s Insane

1 hour ago

08:17

Tutorials

OpenAI Codex Now Works from Anywhere (Dispatch Killer?)

1 day ago

10:54

Tutorials

Talkie: I Ran a 1930 AI Model Locally and Talked to People from the Past

1 day ago

03:02

Tutorials

Installing Claude Code

1 day ago

08:41

Tutorials

Luce DFlash Meets OpenClaw – Local AI Agents at 2x Speed with Qwen3.6-27B

2 days ago