Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

More

Descriptions:

Samuel Humeau, AI scientist at Mistral and former researcher at Facebook FAIR, delivers a technical talk at the AI Engineer conference timed to the launch of Mistral’s first open-source text-to-speech model. The presentation covers why the TTS field has converged on large language model-style autoregressive architectures, explaining how audio frames (~80ms chunks at 12 frames per second) are compressed through neural codecs into discrete tokens that transformer decoders can process sequentially — analogous to next-token prediction in text models.

Humeau walks through the core engineering tradeoffs: raw audio requires ~200 kbps to encode (MP3 quality), while human speech carries only ~15 bits per second of semantic content. The gap explains why specialized audio codecs are essential for reducing audio to a manageable token stream without losing voice characteristics. He demonstrates Mistral’s TTS model performing real-time voice cloning from a few seconds of reference audio and shows how streaming the first audio packets before full generation completes dramatically reduces perceived latency in agent pipelines.

The talk includes a live demo of a voice agent built with Mistral’s speech-to-text, an LLM, and the new TTS model, illustrating an end-to-end pipeline suited for conversational AI products. Engineers building voice interfaces or real-time agent systems will find the architecture walkthrough and latency optimization techniques directly applicable.


📺 Source: AI Engineer · Published May 09, 2026
🏷️ Format: Deep Dive

1 Item

Channels

1 Item

Companies