Scenema Audio: AI Voice That Actually Performs – Rage, Grief, Joy in One Generation Locally

Scenema Audio: AI Voice That Actually Performs – Rage, Grief, Joy in One Generation Locally

More

Descriptions:

Fahd Mirza installs and tests Cinema Audio, a new expressive text-to-speech model extracted from LTX Video 2.3’s 22-billion-parameter audio-visual model, which was trained on real film footage rather than studio recordings. Unlike conventional TTS systems that produce smooth but emotionally flat speech, Cinema Audio accepts XML-style action tags embedded directly in a script to shift emotional delivery mid-generation—enabling a single uninterrupted audio pass to move from rage to grief to a forced laugh.

Mirza runs the system locally on an Ubuntu server equipped with an NVIDIA RTX 6000 (48GB VRAM) using Docker Compose, noting the full-precision model consumes approximately 21GB of VRAM at runtime. The underlying pipeline chains five specialized models in sequence: Google’s Gemma 3 12B instruction-tuned language model for prompt conditioning, an audio diffusion transformer (the core generative engine), a mel-band reformer for stripping environmental sound from the vocal track, SeedVC for voice identity transfer, and Kukoro for sentence-boundary splitting on long-form generation. Final output is 48kHz stereo audio.

The video includes live generation tests in English, Arabic (Egyptian accent), and Polish, plus a voice cloning test using a provided reference audio file. Mirza walks through the Gradio interface, generation parameters, quantization options for lower-VRAM setups, and commentary on output quality across languages. It serves as a practical guide for developers and researchers looking to run emotion-aware, locally hosted speech synthesis without relying on external APIs.


📺 Source: Fahd Mirza · Published May 18, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels