The NEW Best ASR – NVIDIA Nemotron 3.5 ASR

The NEW Best ASR – NVIDIA Nemotron 3.5 ASR

More

Descriptions:

NVIDIA’s Nemotron 3.5 ASR is a 600-million-parameter streaming speech recognition model that transcribes 40 languages from a single checkpoint and can be fully self-hosted. In this detailed walkthrough, AI developer Sam Witteveen explains what makes Nemotron 3.5 technically distinct from existing solutions like Whisper: a mechanism called cache-aware streaming. Rather than re-encoding overlapping audio chunks on every pass, the model caches encoder self-attention states and reuses them as new audio arrives—conceptually similar to KV-caching in large language model decoding. NVIDIA reports up to 17x efficiency gains on H100 hardware; Witteveen corroborates noticeably faster throughput running the model on a DGX system.

The video walks through configurable inference chunk sizes—80ms, 160ms, 320ms, 560ms, or ~1 second—allowing developers to trade latency for transcription granularity depending on their use case. Witteveen also demonstrates word boosting, a decode-time technique that steers the model toward user-supplied vocabulary (product names, proper nouns) using a scoring tree, with no retraining required. A third feature, diarization, enables speaker-level attribution for multi-speaker audio.

Witteveen runs the full demo over a local network from an NVIDIA DGX to a Mac, showing real-time transcription at different latency settings. He notes that community members have already released quantized and MLX versions of the model. For developers currently running Whisper or similar batch-oriented ASR pipelines, this video serves as a practical evaluation guide for migrating to a production-ready, low-latency streaming alternative.


📺 Source: Sam Witteveen · Published June 07, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

1 Item

Companies