Tencent’s Covo-Audio: Local Install & Demo of a 7B End-to-End Voice AI Model

Tencent’s Covo-Audio: Local Install & Demo of a 7B End-to-End Voice AI Model

More

Descriptions:

Fahd Mirza walks through a complete local installation and live demonstration of Tencent’s Kovo Audio — a 7 billion parameter end-to-end audio language model that processes raw audio input and produces audio output within a single unified system. Unlike conventional voice AI pipelines that chain together speech recognition, a language model, and a text-to-speech component, Kovo Audio handles all three stages in one pass.

The video explains the architecture with practical clarity: audio enters through a Whisper Large V3 encoder, is compressed through an adapter, and feeds into a Qwen 2.5 7B language backbone alongside text tokens. The model generates a mixed sequence of text and discrete audio tokens using a WaveLM-based speech tokenizer with a codebook of approximately 16,000 entries. Those tokens then pass through a flow matching network that enriches them into a richer acoustic representation, followed by a BigVAN vocoder that reconstructs 24 kHz audio waveforms. The model ships in two variants: Kovo Audio Chat for standard half-duplex conversations and Kovo Audio Chat FD for full-duplex real-time interaction with interruption handling.

Mirza runs the full installation on Ubuntu with an NVIDIA RTX A6000 (48GB VRAM), downloads the model from Hugging Face, and demos a two-turn spoken conversation covering black holes and their scale. The loaded model consumes under 28GB of VRAM. Kovo Audio currently supports only English and Chinese, and is fully open source on GitHub and Hugging Face.


📺 Source: Fahd Mirza · Published April 07, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels