Descriptions:
Fireship’s Code Report covers Google’s release of Gemma 4, a large language model launched under the Apache 2.0 license — making it one of the few genuinely unrestricted open-source models from a major American tech company. Unlike Meta’s Llama, which carries a commercial-use carve-out, or OpenAI’s GPT OSS models (which are larger and score lower on benchmarks), Gemma 4 offers full freedom to use, modify, and commercialize with no strings attached.
The more surprising story is the size. The 31-billion-parameter Gemma 4 can run on a single RTX 4090 with a 20GB download at roughly 10 tokens per second, while reaching benchmark scores comparable to Kimmy K2.5 — a model that requires 600+ GB of weights, 256 GB of RAM, and multiple H100s to run. The video explains two key techniques behind this efficiency: TurboQuant, a Google research method that compresses model weights by converting Cartesian coordinates to polar form and applying the Johnson-Lindenstrauss transform to reduce dimensions to single sign bits while preserving relative distances; and per-layer embeddings, where each transformer layer gets its own token representation rather than carrying a single embedding through every layer.
The video also briefly benchmarks Gemma 4 running locally with Ollama and notes its potential as a fine-tuning base using tools like Unsloth. For developers who want capable local inference without data-center hardware, Gemma 4 represents a meaningful shift in what is practically accessible.
📺 Source: Fireship · Published April 08, 2026
🏷️ Format: News Analysis







