Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI

Tutorials4 weeks ago

Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI

Descriptions:

Adrien Grondin, developer of the Locally AI app, delivers a technical walkthrough of running Google’s Gemma 4 model directly on iPhone using Apple’s MLX framework. Presented at the AI Engineer conference, the talk covers how MLX Swift LM — Apple’s open-source machine learning library optimized for Apple Silicon — enables on-device inference reaching 40 tokens per second on the latest iPhones when using an 8-bit quantized Gemma 4 model.

Grondin explains the complete setup pipeline: integrating the MLX Swift LM GitHub repository into an iOS or macOS app, sourcing quantized model weights from the MLX Community on Hugging Face (now hosting nearly 4,000–5,000 models), and choosing the right quantization level. He recommends staying between 4-bit and 8-bit, noting that dropping below 4-bit meaningfully degrades output quality. The talk also highlights the broader MLX ecosystem, including MLX VLM for visual models, MLX audio for speech processing, and MLX video for generative video.

Practical notes include the fact that older iPhones can still achieve usable speeds around 20 tokens per second, and that very small models in the 300–350 million parameter range can run inside iOS Shortcuts for lightweight automation tasks. The Locally AI app, which supports Gemma 4, Qwen, and other MLX-compatible models, is available free on the App Store.

📺 Source: AI Engineer · Published April 20, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

No Image Available

AI Engineer

Tags

Apple Gemma 4 Google iPhone LM Studio MLX

Prev

Turbovec – Google’s TurboQuant Implementation with Ollama | 8x Compression Proven

Turbovec – Google’s TurboQuant Implementation with Ollama | 8x Compression Proven

Next

Kimi K2.6 + OpenClaw – Two AI Agents Build a Full App Together

18 Related Posts

Related Posts

14:38

Tutorials

Using HiDream-O1 Natively in ComfyUI

1 hour ago

14:22

Tutorials

Codex Mobile Released and It’s Insane

1 hour ago

08:17

Tutorials

OpenAI Codex Now Works from Anywhere (Dispatch Killer?)

1 day ago

10:54

Tutorials

Talkie: I Ran a 1930 AI Model Locally and Talked to People from the Past

1 day ago

03:02

Tutorials

Installing Claude Code

1 day ago

08:41

Tutorials

Luce DFlash Meets OpenClaw – Local AI Agents at 2x Speed with Qwen3.6-27B

2 days ago