VoxCPM2 – Free TTS Model That Clones Voices, Designs New Ones & Speaks 30 Languages Locally

VoxCPM2 – Free TTS Model That Clones Voices, Designs New Ones & Speaks 30 Languages Locally

More

Descriptions:

VALL-E X CPM 2 is an open-source text-to-speech model that can clone voices, synthesize entirely new voices from plain-text descriptions, and generate speech across 30 languages — all running locally on consumer or prosumer hardware. In this hands-on walkthrough, Fahd Mirza installs the model on an Ubuntu server equipped with an NVIDIA A6000 GPU (48GB VRAM), walking through the full conda environment setup, repo clone, and Gradio web UI launch.

The video covers three distinct test scenarios: a zero-configuration TTS pass, a voice design test where a “deep, dramatic movie trailer voice” is synthesized from a text prompt alone with no reference audio, and a voice cloning test using a short personal recording with an emotion-control instruction (cheerful and energetic). Results are candidly reported — the voice design output is striking and the basic TTS is clean and fast, but the emotion-guided cloning test produces flat, monotonous output, falling short of the advertised control. VRAM consumption sits at roughly 45GB during inference, and generation speed is noticeably faster than the prior VoiceCPM iteration.

For developers building multilingual voice pipelines or exploring local TTS alternatives, VALL-E X CPM 2 stands out for its no-reference-audio voice generation capability. The model handles languages from Arabic, Japanese, and Hindi to multiple Chinese dialects without requiring language tags — it infers the target language automatically. The video includes the GitHub repo link and enough setup detail to reproduce the results independently.


📺 Source: Fahd Mirza · Published April 08, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels