Microsoft Phi-4-Reasoning-Vision-15B: Run Smart Multimodal Model Locally

Tutorials2 months ago

Microsoft Phi-4-Reasoning-Vision-15B: Run Smart Multimodal Model Locally

Descriptions:

Microsoft’s Phi-4-Reasoning-Vision is a newly released 15-billion-parameter open-weight multimodal model built for two focused tasks: mathematical and scientific reasoning over visual inputs such as charts, equations, and diagrams, and computer-use agent tasks that require reading a screen and identifying click targets. Fahd Mirza provides a full local deployment walkthrough on Ubuntu using an NVIDIA RTX 6000 GPU with 48GB VRAM.

The installation is more involved than a typical Hugging Face pull: Phi-4-Reasoning-Vision uses a midfusion architecture where a SigLIP vision encoder (a Google-developed image-to-token converter) feeds visual tokens through an MLP projection layer into the middle of the language backbone — Phi-4-Reasoning — rather than at the input or output stage. Because vLLM does not natively support this architecture, a custom plugin must be installed first. Mirza walks through each step including the plugin install, model serving command, and a Python client for submitting image-plus-text prompts. The loaded model consumes approximately 44GB of VRAM. He then demos the model via Open WebUI, testing it on handwritten math equations and complex algebra problems where the model spends two to three minutes in chain-of-thought reasoning before returning a correct answer.

Benchmark highlights include an 88.2% score on ScreenSpot v2 for GUI grounding and competitive MathVista results. Mirza notes that 3VL 32B outperforms it on most reasoning benchmarks, but frames Phi-4-Reasoning-Vision as a model optimized for local use rather than leaderboard dominance — a design tradeoff he views positively.

📺 Source: Fahd Mirza · Published March 05, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

No Image Available

Fahd Mirza

1 Item

Companies

No Image Available

Microsoft

Tags

Google Microsoft VLLM

Prev

Build Agent Teams within Claude Cowork in 17 min

Build Agent Teams within Claude Cowork in 17 min

Next

GPT 5.4 “we see no wall”

GPT 5.4 “we see no wall”

18 Related Posts

Related Posts

14:22

Tutorials

Codex Mobile Released and It’s Insane

2 hours ago

14:38

Tutorials

Using HiDream-O1 Natively in ComfyUI

2 hours ago

10:54

Tutorials

Talkie: I Ran a 1930 AI Model Locally and Talked to People from the Past

1 day ago

03:02

Tutorials

Installing Claude Code

1 day ago

08:17

Tutorials

OpenAI Codex Now Works from Anywhere (Dispatch Killer?)

1 day ago

08:41

Tutorials

Luce DFlash Meets OpenClaw – Local AI Agents at 2x Speed with Qwen3.6-27B

2 days ago