Descriptions:
Sam Witteveen breaks down IBM’s Granite Speech 4.1 release — a suite of three ~2-billion-parameter ASR models designed for edge deployment that collectively challenge established leaders like OpenAI’s Whisper, Whisper X, and NVIDIA’s Parakeet. The video opens with context on IBM’s broader Granite model family, which spans language, vision, speech, and embedding models, before zeroing in on what makes this speech release distinctly different: three specialized variants rather than a single general-purpose model.
The base model, Granite Speech 4.1 2B, currently sits atop the HuggingFace Open ASR Leaderboard with a word error rate of 5.33 and a real-time factor of roughly 231 — meaning an hour of audio transcribed in about 16 seconds. It supports seven languages plus bidirectional translation and includes keyword biasing, letting users inject domain-specific terms or names directly into the prompt. The Plus variant adds speaker diarization and word-level timestamps, beating customized Whisper builds on timestamp accuracy — making it the go-to for podcast and meeting transcription use cases.
The most technically ambitious of the three is the non-autoregressive model (the “N” variant), which uses IBM’s proprietary NLE (Non-autoregressive LLM-based Editing) technique. Rather than generating tokens sequentially, it edits a fast CTC draft transcript using bidirectional attention — achieving a real-time factor of 1,820 on an H100 GPU, or roughly one hour of audio in two seconds, with only a modest accuracy tradeoff. Witteveen walks through the architectural reasoning clearly, making this a strong resource for developers evaluating ASR infrastructure.
📺 Source: Sam Witteveen · Published May 07, 2026
🏷️ Format: Deep Dive







