From Transcription to Live Music: Gemini’s Audio Stack — Thor Schaeff, Google DeepMind

From Transcription to Live Music: Gemini’s Audio Stack — Thor Schaeff, Google DeepMind

More

Descriptions:

Thor Schaeff, developer experience lead on the Gemini API and Google AI Studio at Google DeepMind, walks through the current state of Gemini’s audio stack in a conference session at AI Engineer. Starting from audio understanding in Gemini 3 — which goes beyond transcription to extract speaker identity, emotional tone, language, and timestamps within a single API call — he traces the progression through Gemma 4’s on-device audio support (available on edge devices) and the recently launched Gemini 3.1 Flash Live, a full-duplex real-time conversational model that simultaneously handles voice, text, and vision input.

The talk includes a live demo of Echo Script, a Gemini 3 Flash Preview application in the Google AI Studio gallery that demonstrates rich audio extraction in one request: it labels speakers by name, identifies languages, flags emotional register, and generates English translations from multilingual audio. A second demo covers Gemini’s speech generation philosophy — rather than selecting from a large library of static voices, developers direct roughly 30 base voices using “director’s notes,” a scene-setting and performance instruction approach that allows precise accent, tone, and delivery shaping without hard-coded alternatives.

Schaeff also references Video 3.1 Light on the generative media side and explains that all dedicated audio models are now built on top of Gemini 3’s foundational research. The session is aimed at developers building voice and multimodal applications on the Gemini API and offers a practical map of what is available today in Google AI Studio alongside the capabilities driving the underlying models.


📺 Source: AI Engineer · Published June 09, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

1 Item

Companies