Descriptions:
Fahd Mirza covers KAME (Japanese for “turtle”), a new speech-to-speech model from Sakana AI that introduces a tandem architecture designed to resolve the fundamental tradeoff between latency and intelligence in conversational AI. Direct speech-to-speech models like Moshi respond instantly but lack deep knowledge; cascaded systems (speech-to-text → LLM → text-to-speech) are knowledgeable but slow. KAME runs both in parallel: a fast frontend model handles immediate response while a streaming speech-to-text component simultaneously feeds a backend LLM — configurable as GPT, Gemini, Claude, or others. The backend’s outputs flow back into the frontend transformer as “Oracle signals” in real time, progressively improving response quality without adding perceptible latency.
The video walks through full installation on Ubuntu using the UV package manager. The model runs locally but currently requires both an OpenAI API key and a Google Cloud service account with the Speech-to-Text API enabled — a setup friction Mirza flags as a meaningful barrier. On an Nvidia RTX A6000 with 48GB of VRAM, the model loads and consumes approximately 18GB. Live testing shows conversational capability comparable to Moshi, with some responsiveness and turn-taking issues observed during the demo.
For developers evaluating real-time voice AI options, the architectural explanation — covering the transformer’s four concurrent streams (audio in, audio out, inner monologue, and Oracle channel) — is the clearest technical treatment of the KAME design currently available in video format.
📺 Source: Fahd Mirza · Published May 12, 2026
🏷️ Format: Tutorial Demo







