NVIDIA’s MagpieTTS Multilingual: One AI Voice, 9 Languages: Run Locally

NVIDIA’s MagpieTTS Multilingual: One AI Voice, 9 Languages: Run Locally

More

Descriptions:

Fahd Mirza installs and demonstrates MagpieTTS, NVIDIA’s new multilingual text-to-speech model built on the NeMo framework, testing it live across nine languages: English, German, French, Spanish, Italian, Vietnamese, Mandarin, Hindi, and Japanese. At roughly 357 million parameters, MagpieTTS is notably compact — requiring just over 3GB of VRAM on the H100 used in the video, and feasibly runnable on CPU hardware without a GPU at all.

The architecture combines a non-autoregressive transformer text encoder with an autoregressive decoder that predicts discrete audio codec tokens across eight parallel codebooks. Those tokens are converted to a waveform at 22kHz via NanoCodec, NVIDIA’s neural audio codec. Quality is improved through attention priors, classifier-free guidance (CFG), and reinforcement learning via Group Relative Policy Optimization (GRPO), with an optional local transformer refinement stage layered on top of the primary decoder. The Gradio-based demo interface is launched locally at port 7860, with setup taking only a few minutes via a Conda environment and the NeMo GitHub repository.

Mirza notes an important caveat: all speakers in the training dataset are English-native, which can introduce noticeable accents when synthesizing lower-resource languages like Vietnamese. Each language output is played back live, with Mirza inviting native speakers to evaluate quality in the comments. The video is a practical introduction to NVIDIA’s growing open-source speech synthesis capabilities and a useful starting point for developers building multilingual voice applications.


📺 Source: Fahd Mirza · Published March 07, 2026
🏷️ Format: Hands On Build

1 Item

Channels

1 Item

Companies