Descriptions:
Fahd Mirza demonstrates Google’s newly released MTP (multi-token prediction) draft models for the Gemma 4 family, running live tests on an NVIDIA H100 80GB GPU. The video explains the speculative decoding mechanism: a small lightweight drafter model predicts several tokens ahead in parallel, and the full 31-billion-parameter Gemma 4 model verifies those predictions in a single forward pass — preserving output quality while dramatically cutting inference time.
In live benchmarks using a hospital management system design prompt generating approximately 2,048 tokens, the MTP drafter configuration achieved 27.4 tokens per second compared to just 8.8 tokens per second without the drafter — a roughly 3x speedup consistent with Google’s published claims on A100 hardware, and likely conservative given the H100’s greater throughput. Total VRAM consumption held at around 62–63GB for the full model stack including the drafter.
Mirza walks through the complete installation process — setting up a Python virtual environment, logging into Hugging Face, and downloading both the 31B primary model and the 939MB drafter — then runs the same inference script with and without the companion model to produce a direct comparison. The video also contextualizes MTP drafters within the broader inference optimization landscape by comparing the approach to D-Flash and P-Flash techniques. For developers who found Gemma 4’s original single-token generation painfully slow, the companion drafter effectively closes much of the latency gap with no change in output fidelity.
📺 Source: Fahd Mirza · Published May 05, 2026
🏷️ Format: Benchmark Test







