Any-to-Any: Building Native Multimodal Agents – Patrick Löber, Google DeepMind

Any-to-Any: Building Native Multimodal Agents – Patrick Löber, Google DeepMind

More

Descriptions:

Patrick Löber, a member of the technical staff at Google DeepMind working on the Gemini API and AI Studio, delivers a hands-on conference session on building native multimodal agents. He walks through the full capability surface of the Gemini API—text, image, audio, video, code, URLs, and Google Search as inputs; text, images, speech, video, and function calls as outputs—and explains the current architecture honestly: Gemini 3 handles multimodal understanding and outputs text, while specialized native generation models like Nano Banana 2 (also surfaced in the API as Gemini 3.5 Flash Image Preview) handle image and infographic creation alongside a separate text-to-speech model.

Löber shares specific practical details rarely consolidated in documentation: 1 minute of audio consumes 1,920 tokens against a 1M token limit (enabling over 9 hours of audio input per call), enabling context caching cuts costs by roughly 90% on repeated long-file queries, and YouTube URLs can be passed directly into the API. Code snippets demonstrate the Google AI SDK setup in under ten lines, with a pointer to the Gemini API skill as a shortcut for agent-based workflows.

The session builds toward a live NotebookLM clone. The architecture places Gemini 3 in an agentic reasoning loop that calls specialized generation models via function calls to produce text summaries, infographics, and two-speaker podcast audio from mixed inputs—PDFs, video lectures, and voice memos. The full loop logic, including how the model self-assesses whether it needs more assets before terminating, gives developers a concrete blueprint for multimodal agent construction across the Gemini ecosystem.


📺 Source: AI Engineer · Published May 20, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

1 Item

Companies