Descriptions:
Neil Zeghidour, co-founder of Gradian AI and creator of Moshi (the first full-duplex speech-to-speech model), gives a technically grounded assessment of where voice AI actually stands relative to the ‘Her moment’ — the benchmark of genuinely natural, human-feeling conversational voice set by the 2013 film. Gradian spun out of a non-profit lab funded by Eric Schmidt, Rodolphe Saadé, and Xavier Niel, and its portfolio includes Moshi, Pocket TTS (a CPU-optimized TTS model), and voice cloning from as little as 10 seconds of audio.
The core technical argument is architectural: virtually every deployed voice AI system — including the best offerings from ElevenLabs and OpenAI’s Advanced Voice Mode — is half-duplex. The model is either listening or speaking; it cannot handle simultaneous speech, natural interruptions, or back-channeling. In Japanese conversation, up to 20% of dialogue involves overlapping speech, and back-channeling (continuous ‘mhm’ affirmations) is a sign of active listening. Half-duplex systems break on all of these, making them feel robotic regardless of voice quality. Moshi is presented as the only production model that has crossed into full-duplex territory.
Zeghidour also dissects latency strategies in cascaded systems (STT → LLM → TTS pipelines), including the technique of generating filler speech while the LLM processes to mask response delays. His broader point: speech-to-speech architecture reduces latency but does not solve the human-conversation problem alone — the underlying model intelligence must also be sufficient for the voice interface to feel genuinely useful rather than just smoother-sounding.
📺 Source: AI Engineer · Published May 09, 2026
🏷️ Format: Deep Dive







