Descriptions:
Latent Space hosts Guillaume Lample, Mistral’s Chief Scientist, and Pavan Kumar Reddy, Head of Audio Research, for a first-party announcement of Voxtral TTS — Mistral’s first text-to-speech model. The 3-billion-parameter model supports nine languages, is built on top of the Ministral base, and introduces a novel autoregressive flow matching architecture paired with an in-house neural audio codec that converts audio into latent semantic and acoustic tokens at a fraction of the cost of competing TTS services while matching their quality at the base model level.
The conversation traces Mistral’s full audio model lineage: the original Voxtral ASR model released in summer 2024, a multilingual transcription update in January 2025 adding context biasing and real-time streaming, and now the generation side with Voxtral TTS. Pavan explains the architectural distinction between understanding models (audio encoder feeding continuous embeddings into a transformer decoder) and generation models (requiring the neural codec on the output side), and why the autoregressive flow matching approach was selected after iterating through several internal architectures.
The second half covers Forge, Mistral’s enterprise training platform announced at GTC — the same internal tooling Mistral’s science team uses for continued pretraining, SFT, and RLHF, now offered to customers to fine-tune models on proprietary data. Guillaume argues that enterprise customers using closed-source models are leaving enormous value on the table by not leveraging domain-specific datasets they have accumulated for years, and that fine-tuning on that data can produce models that dramatically outperform general-purpose alternatives for specialized tasks, including building models where a target language represents 50% of the training mix rather than under 1%.
📺 Source: Latent Space · Published March 30, 2026
🏷️ Format: Interview







