Gemma 4 12B QAT + MTP on llama.cpp Locally – Twice the Speed, Same Quality?

Gemma 4 12B QAT + MTP on llama.cpp Locally – Twice the Speed, Same Quality?

More

Descriptions:

This video by Fahd Mirza walks through running Google’s newly released Gemma 4 12B QAT (Quantization-Aware Training) model alongside llama.cpp’s freshly merged Multi-Token Prediction (MTP) support — both on a local Ubuntu machine equipped with an Nvidia RTX 6000 GPU (48GB VRAM).

Mirza explains what makes each technique meaningful: QAT trains the model to tolerate 4-bit compression from the start rather than quantizing after the fact, shrinking the model from ~24GB to ~7GB while preserving quality closer to the full-precision version. MTP bakes extra prediction heads directly into the model weights so a single forward pass can draft multiple tokens at once, pushing throughput from roughly 60 to over 120 tokens per second on a 12GB GPU with no output-quality penalty. The video covers the exact huggingface-cli download commands, the llama.cpp build/update steps needed to include the MTP commit, and the server launch flags Google recommends for sampling.

Quality is tested live through a reasoning/joke prompt and a challenging Oracle 11g SQL debugging task, where the model correctly identifies a ROWNUM ordering trap and rewrites the query. Viewers comfortable with llama.cpp who want to push Gemma 4 performance on consumer or prosumer hardware will find specific, reproducible guidance here.


📺 Source: Fahd Mirza · Published June 07, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

1 Item

Companies