Accelerating AI on Edge — Chintan Parikh and Weiyi Wang, Google DeepMind

Accelerating AI on Edge — Chintan Parikh and Weiyi Wang, Google DeepMind

More

Descriptions:

Chintan Parikh, product manager for LiteRT at Google AI Edge, and Weiyi Wang from Google DeepMind present at AI Engineer on deploying Gemma 4 models directly on device. The session introduces two new edge-optimized variants: Gemma 4 E2B, requiring roughly 1–2 GB of RAM post-quantization and suited for voice interfaces and low-latency local summarization, and Gemma 4 E4B, targeting laptops and IoT devices that can handle a higher RAM footprint. Both models ship with capabilities previously absent from edge deployments: native function calling for local API interactions, structured JSON output built into the model architecture (rather than achieved via prompt engineering), and a chain-of-thought thinking mode for more complex reasoning tasks.

Parikh outlines the core case for edge deployment across four axes: latency (real-time camera and AR use cases where cloud round-trips are prohibitive), privacy (sensitive document processing that should never leave the device), offline reliability, and cost reduction relative to cloud API token consumption. A demo of the Gallery app shows multi-skill orchestration, prompt-to-audio generation, and dynamic accelerator switching between CPU, GPU, and (upcoming) NPU — all running locally.

The underlying runtime is LiteRT, Google’s on-device inference framework built on TensorFlow Lite, which the team reports is deployed across more than 100,000 apps with billions of active users. The model format is cross-platform, supporting Android, iOS, macOS, Linux, Windows, and web from a single binary, with the sample app open-sourced on GitHub for developers to fork and extend.


📺 Source: AI Engineer · Published May 05, 2026
🏷️ Format: Keynote Launch

1 Item

Channels

1 Item

Companies