Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI

Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI

More

Descriptions:

Adrien Grondin, developer of the Locally AI app, delivers a technical walkthrough of running Google’s Gemma 4 model directly on iPhone using Apple’s MLX framework. Presented at the AI Engineer conference, the talk covers how MLX Swift LM — Apple’s open-source machine learning library optimized for Apple Silicon — enables on-device inference reaching 40 tokens per second on the latest iPhones when using an 8-bit quantized Gemma 4 model.

Grondin explains the complete setup pipeline: integrating the MLX Swift LM GitHub repository into an iOS or macOS app, sourcing quantized model weights from the MLX Community on Hugging Face (now hosting nearly 4,000–5,000 models), and choosing the right quantization level. He recommends staying between 4-bit and 8-bit, noting that dropping below 4-bit meaningfully degrades output quality. The talk also highlights the broader MLX ecosystem, including MLX VLM for visual models, MLX audio for speech processing, and MLX video for generative video.

Practical notes include the fact that older iPhones can still achieve usable speeds around 20 tokens per second, and that very small models in the 300–350 million parameter range can run inside iOS Shortcuts for lightweight automation tasks. The Locally AI app, which supports Gemma 4, Qwen, and other MLX-compatible models, is available free on the App Store.


📺 Source: AI Engineer · Published April 20, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels