Any-to-Any: Building Native Multimodal Agents – Patrick Löber, Google DeepMind

Tutorials2 months ago

Any-to-Any: Building Native Multimodal Agents – Patrick Löber, Google DeepMind

Descriptions:

Patrick Löber, a member of the technical staff at Google DeepMind working on the Gemini API and AI Studio, delivers a hands-on conference session on building native multimodal agents. He walks through the full capability surface of the Gemini API—text, image, audio, video, code, URLs, and Google Search as inputs; text, images, speech, video, and function calls as outputs—and explains the current architecture honestly: Gemini 3 handles multimodal understanding and outputs text, while specialized native generation models like Nano Banana 2 (also surfaced in the API as Gemini 3.5 Flash Image Preview) handle image and infographic creation alongside a separate text-to-speech model.

Löber shares specific practical details rarely consolidated in documentation: 1 minute of audio consumes 1,920 tokens against a 1M token limit (enabling over 9 hours of audio input per call), enabling context caching cuts costs by roughly 90% on repeated long-file queries, and YouTube URLs can be passed directly into the API. Code snippets demonstrate the Google AI SDK setup in under ten lines, with a pointer to the Gemini API skill as a shortcut for agent-based workflows.

The session builds toward a live NotebookLM clone. The architecture places Gemini 3 in an agentic reasoning loop that calls specialized generation models via function calls to produce text summaries, infographics, and two-speaker podcast audio from mixed inputs—PDFs, video lectures, and voice memos. The full loop logic, including how the model self-assesses whether it needs more assets before terminating, gives developers a concrete blueprint for multimodal agent construction across the Gemini ecosystem.

📺 Source: AI Engineer · Published May 20, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

No Image Available

AI Engineer

1 Item

Companies

No Image Available

DeepMind

Tags

DeepMind Gemini 3 Flash Gemini 3 Pro Gemini 3.1 Flash Live Gemini API Gemma 4 Google AI Studio Nano Banana NotebookLM

Prev

Wizstar AI Video Generator – Full Marketing Video From Just an Amazon Link | Full Walkthrough

Next

This AI Model Has No VAE! Testing HiDream-O1’s Unified Transformer

18 Related Posts

Related Posts

08:04

Tutorials

Herdr: Run Multiple AI Coding Agents in Parallel from Your Terminal

2 hours ago

15:54

Tutorials

Buzz Huddle Test: 4 Humans, 2 AI Agents

2 hours ago

22:53

Tutorials

The Viral $1 Website Effect That Looks Like $10K (Tutorial)

1 day ago

20:17

Tutorials

Paste This Into Claude, Never Hit a Token Limit Again

1 day ago

15:54

Tutorials

AI Video 101: How to Master AI Videos (Beginner to Advanced)

1 day ago

08:12

Tutorials

How to Run Kimi K3 Locally (3 Ways)

1 day ago