Descriptions:
Microsoft’s Phi-4-Reasoning-Vision is a newly released 15-billion-parameter open-weight multimodal model built for two focused tasks: mathematical and scientific reasoning over visual inputs such as charts, equations, and diagrams, and computer-use agent tasks that require reading a screen and identifying click targets. Fahd Mirza provides a full local deployment walkthrough on Ubuntu using an NVIDIA RTX 6000 GPU with 48GB VRAM.
The installation is more involved than a typical Hugging Face pull: Phi-4-Reasoning-Vision uses a midfusion architecture where a SigLIP vision encoder (a Google-developed image-to-token converter) feeds visual tokens through an MLP projection layer into the middle of the language backbone — Phi-4-Reasoning — rather than at the input or output stage. Because vLLM does not natively support this architecture, a custom plugin must be installed first. Mirza walks through each step including the plugin install, model serving command, and a Python client for submitting image-plus-text prompts. The loaded model consumes approximately 44GB of VRAM. He then demos the model via Open WebUI, testing it on handwritten math equations and complex algebra problems where the model spends two to three minutes in chain-of-thought reasoning before returning a correct answer.
Benchmark highlights include an 88.2% score on ScreenSpot v2 for GUI grounding and competitive MathVista results. Mirza notes that 3VL 32B outperforms it on most reasoning benchmarks, but frames Phi-4-Reasoning-Vision as a model optimized for local use rather than leaderboard dominance — a design tradeoff he views positively.
📺 Source: Fahd Mirza · Published March 05, 2026
🏷️ Format: Tutorial Demo







