Voice In, Visuals Out: The Agony and the Ecstasy – Allen Pike, Forestwalk Labs

Voice In, Visuals Out: The Agony and the Ecstasy – Allen Pike, Forestwalk Labs

More

Descriptions:

Allen Pike of Forestwalk Labs delivers a practical engineering talk on building what Andrej Karpathy has called “voice in, visuals out” experiences — AI interfaces where users speak naturally and receive visual responses rather than text. Pike argues this design pattern resolves a fundamental mismatch: voice carries far higher bandwidth than typing (more words per minute, plus tone and emphasis), while visual output is far more information-dense than synthesized speech.

The core technical challenge Pike addresses is latency. Full voice-to-voice conversation requires sub-200ms round-trip response to feel natural, an essentially impossible bar given current network, speech-to-text, inference, and text-to-speech pipeline costs. Visual output, however, has a much more forgiving envelope — responses appearing within one second still feel responsive. Forestwalk’s in-call voice agent, which files Linear issues and takes action on incidental speech during meetings, exploits this asymmetry. Pike shares a counterintuitive finding from production: GPT-4o mini, despite being a small model, showed P95 latencies of 5,000–10,000ms on standard OpenAI endpoints, making inference platform selection as important as model selection when optimizing for latency.

Three practical lessons close the talk: use inference providers that optimize for latency over throughput, target the forgiving visual response window instead of chasing voice-to-voice, and stream early — begin rendering a visual response before the full answer is generated to stay within the user’s attention span. Pike also references Thinking Machines and Neolab’s 200ms time-sliced continuous inference architecture as a promising voice-to-voice approach for teams that need it.


📺 Source: AI Engineer · Published June 28, 2026
🏷️ Format: Hands On Build

1 Item

Channels