Engineering voice agents: Latency, quality, and scale — Rishabh Bhargava, Together AI

Foundation Models2 months ago

Engineering voice agents: Latency, quality, and scale — Rishabh Bhargava, Together AI

Descriptions:

Rishabh Bhargava, who leads the voice AI team at Together AI and previously co-founded Refuel (acquired by Together), delivers a practical engineering breakdown of production voice agents at the AI Engineer conference. The talk targets teams moving beyond demos toward reliable, scalable voice deployments handling hundreds or thousands of concurrent calls.

The architecture covered is the dominant production pattern: a cascading pipeline of speech-to-text, an LLM orchestrator (such as Pipecat), and a text-to-speech engine, with audio streamed in chunks. Bhargava defines the key metrics engineers must track: time to first audio (TTFA) — how quickly the first audio chunk begins streaming after transcript receipt — and real-time factor (RTF), where a value below 1.0 means audio is produced faster than it plays. Practical thresholds are grounded in human conversation data: people respond in roughly 300 milliseconds, begin noticing delays beyond 500 milliseconds, and hang up at one to two seconds. The LLM stage consumes the majority of both latency and cost budget, followed by TTS, then STT.

The talk also covers TTS quality dimensions — voice naturalness, emotional control via prosody tags, name pronunciation, and multilingual coverage — as well as auto-scaling strategy, where teams typically bias aggressively toward scale-up to avoid request queuing. A section on next-generation architectures previews end-to-end speech models that may eventually collapse the pipeline into a single model.

📺 Source: AI Engineer · Published May 31, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

AI Engineer

Tags

Cursor Nvidia OpenAI Together AI Whisper

Prev

Weekly AI Recap — Opus 4.8, Step Audio 3, Bonsai Image and More | May 2026

Next

Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

18 Related Posts

Related Posts

21:09

Foundation Models

Persona Engineering: A Field Guide to AI Synthetic Personas — Ishan Anand, InsightSciences.ai

1 day ago

21:39

Foundation Models

Serving 2 Million Models Without Melting: Scaling the Hugging Face Hub — Arek Borucki, Hugging Face

2 days ago

06:40

Foundation Models

AMD Releases First Ever AI model: Instella-MoE-16B-A3B-Think

2 days ago

24:01

Foundation Models

US AI Dominance Is Over: Here’s Why

3 days ago

17:31

Foundation Models

The Messy Reality of Scale: Synthetic Data and Pre-Training — Marah Abdin & Robert McHardy, poolside

4 days ago

17:57

Foundation Models

Loop Engineering from First Principles — Kyle Mistele, HumanLayer

5 days ago