Engineering voice agents: Latency, quality, and scale — Rishabh Bhargava, Together AI

Engineering voice agents: Latency, quality, and scale — Rishabh Bhargava, Together AI

More

Descriptions:

Rishabh Bhargava, who leads the voice AI team at Together AI and previously co-founded Refuel (acquired by Together), delivers a practical engineering breakdown of production voice agents at the AI Engineer conference. The talk targets teams moving beyond demos toward reliable, scalable voice deployments handling hundreds or thousands of concurrent calls.

The architecture covered is the dominant production pattern: a cascading pipeline of speech-to-text, an LLM orchestrator (such as Pipecat), and a text-to-speech engine, with audio streamed in chunks. Bhargava defines the key metrics engineers must track: time to first audio (TTFA) — how quickly the first audio chunk begins streaming after transcript receipt — and real-time factor (RTF), where a value below 1.0 means audio is produced faster than it plays. Practical thresholds are grounded in human conversation data: people respond in roughly 300 milliseconds, begin noticing delays beyond 500 milliseconds, and hang up at one to two seconds. The LLM stage consumes the majority of both latency and cost budget, followed by TTS, then STT.

The talk also covers TTS quality dimensions — voice naturalness, emotional control via prosody tags, name pronunciation, and multilingual coverage — as well as auto-scaling strategy, where teams typically bias aggressively toward scale-up to avoid request queuing. A section on next-generation architectures previews end-to-end speech models that may eventually collapse the pipeline into a single model.


📺 Source: AI Engineer · Published May 31, 2026
🏷️ Format: Deep Dive

1 Item

Channels