DramaBox – Run Most Expressive TTS with Voice Cloning Locally

DramaBox – Run Most Expressive TTS with Voice Cloning Locally

More

Descriptions:

Fahd Mirza takes a hands-on look at DramaBox, a newly released expressive text-to-speech model that can be run locally on consumer-grade hardware. DramaBox is a fine-tune of Lyra X LTX 2, built on a 3.3 billion parameter audio-only diffusion transformer using flow matching, conditioned on Gemma 3’s 12 billion parameter text embeddings. The architecture pairs a diffusion transformer backbone with an audio variational autoencoder and a vocoder, enabling nuanced control over delivery including pauses, emotional shifts, and mid-sentence tonal changes.

What sets DramaBox apart from standard TTS systems is its treatment of prompts as performance scripts: dialogue goes inside double quotes and is spoken literally, while everything outside functions as a stage direction — instructions like ‘his voice fills with genuine indignation’ or ‘she pauses, exhausted’ shape delivery without being spoken aloud. Mirza demonstrates this across multiple test cases including a male overconfidence monologue, a female voice clone reacting to durian fruit, and a Freudian-style female character.

The video covers installation on Ubuntu with an NVIDIA RTX 6000 (48GB VRAM), with the model consuming just over 16GB of VRAM at runtime and the full weights coming in at roughly 26GB. Mirza’s honest assessment: expressive range is noticeably improved over older TTS models, but voice cloning fidelity still falls short of the best alternatives, and output retains a slightly synthetic quality under close listening.


📺 Source: Fahd Mirza · Published May 13, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels