Descriptions:
The Alphastack channel walks through a custom cross-platform app that runs Google’s Gemma 4 12B multimodal model entirely on-device — no cloud connection required — on both Windows and Android. The architecture splits into three layers: a single HTML file for the UI (rendered via Edge WebView on Windows and Android System WebView on Android), a Python Flask server that manages the inference engine as a subprocess, and llama.cpp build B9512 as the inference backend. The model runs in GGUF format using Unslaught’s dynamic quantization, bringing the full 12B parameter model down to roughly 7GB.
The video covers the complete build pipeline, including how PyInstaller bundles Python, Flask, and the full llama.cpp CUDA build into a self-contained Windows EXE, and how the Android APK compiles llama.cpp directly via the Android NDK as a native library running in a foreground service. A smaller E2B model ships with the app by default; the full 12B can be downloaded through an in-app settings menu. Multimodal vision support works by loading an additional “mm projector” file alongside the base model, converting images into tokens the model can process.
Live inference demos show the app streaming chain-of-thought reasoning separately from the final answer, collapsing the thinking section once the response is complete. The creator notes real-world memory pressure — the 12B model consumed nearly all available GPU VRAM during recording alongside other processes — but emphasizes that on a dedicated machine performance should be significantly better. Both the Windows EXE and Android APK are available for download via links in the video description.
📺 Source: Alphastack · Published June 05, 2026
🏷️ Format: Hands On Build







