Descriptions:
Cohere has quietly released a new automatic speech recognition model called Cohere Transcribe, and in this video Fahd Mirza walks through a complete local installation and live test on Ubuntu using an Nvidia RTX 6000 GPU with 48GB VRAM.
The model is a 2-billion parameter ASR system built on a conformer architecture, which combines transformer and convolutional layers to process audio by first converting waveforms into MEL spectrograms. Released under an Apache 2.0 license, it supports 14 languages including Arabic, Japanese, Korean, Polish, Spanish, and Vietnamese, and is claimed to be up to three times faster than comparable dedicated ASR models at its parameter scale. Inference requires only around 5GB of VRAM. The model is gated on Hugging Face, so the video also covers the authentication and token setup process.
Mirza runs the model against audio samples in all 14 supported languages and measures real-time VRAM usage during transcription. He also clearly outlines the model’s known limitations: no automatic language detection, no timestamp generation, no speaker diarization, and a tendency to hallucinate text when fed silence. The transparent treatment of both strengths and shortcomings — including praise for Cohere’s unusually candid model card — makes this a practical reference for developers evaluating local or self-hosted speech-to-text options.
📺 Source: Fahd Mirza · Published March 27, 2026
🏷️ Format: Tutorial Demo







