Intel Squeezed Gemma-4 31B into INT4 – Run It Locally with Half the Memory

Intel Squeezed Gemma-4 31B into INT4 – Run It Locally with Half the Memory

More

Descriptions:

Fahd Mirza demonstrates how Intel’s AutoRound toolkit can quantize Google’s Gemma 4 31B multimodal model from FP16 down to INT4, cutting VRAM requirements from over 60GB to under 19GB while preserving the model’s reasoning and vision capabilities. The video walks through the full setup on a system running Sydney, Australia, covering Conda environment creation, installation of PyTorch, Transformers, and AutoRound, and the key parameters: 4-bit integers, group size 128, symmetric quantization.

A key portion of the video explains what differentiates AutoRound from naive quantization methods. Rather than accepting the accuracy loss that comes with rounding weights to INT4, AutoRound runs a calibration loop using sign gradient descent — iteratively adjusting quantization scaling factors (alpha and beta) to minimize the error between dequantized INT4 weights and the original FP16 values. The result, Mirza argues, is compression without the typical accuracy cliff.

The quantized model is tested on multimodal tasks: analyzing a heavily congested traffic image (correctly identifying vehicle types, lane count, and movement state) and translating sentences from approximately 30 languages into English while identifying each source language. CPU inference is also demonstrated, though Mirza notes the significant speed penalty on even a high-core-count system. For practitioners wanting to run a capable 31B multimodal model on mid-tier or consumer GPU hardware, the video provides a reproducible path using Intel’s open-source tooling.


📺 Source: Fahd Mirza · Published April 10, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels