Descriptions:
Fahd Mirza provides a hands-on walkthrough of NVIDIA’s newly released NeMo-Tron 3.5 ASR — a 600-million-parameter streaming speech recognition model that handles 40 language locales from a single unified architecture. Running on an NVIDIA RTX A6000 with 48GB VRAM, the model consumes just 439MB during inference, making it practical to run on CPU hardware. It delivers punctuated, capitalized transcriptions in real time with configurable chunk sizes as low as 80 milliseconds, and can sustain 70x more concurrent streams than its predecessors on a single GPU.
Mirza explains the model’s cache-aware Fast Conformer RNN-T architecture in detail, walking through how language identity is injected at every frame via a 128-dimensional one-hot vector concatenated with the acoustic embedding — a design that eliminates the need for separate per-language models or a standalone language detection component. In auto-detect mode, the model identifies language on the fly; explicit language hinting provides the greatest benefit for underrepresented languages like Ukrainian and Hindi, where auto-detect showed a few additional percentage points of error.
Testing uses real human voice recordings from Google’s FLEURS dataset across multiple languages, with Mirza noting strong performance on English and German (8–9% word error rate at 320ms chunk size) and generally acceptable quality across most supported locales. He flags the biggest performance gaps on Ukrainian and Hindi as areas where specifying a language ID at inference time meaningfully improves results, and invites native speakers in the comments to verify output quality across less familiar languages.
📺 Source: Fahd Mirza · Published June 12, 2026
🏷️ Format: Tutorial Demo







