Descriptions:
Fahd Mirza walks through a complete local installation of UI-Venus-1.5, a GUI agent model from Inclusion AI designed to navigate websites and applications autonomously by analyzing screenshots. Built on top of Qwen3-VL, the model accepts a screenshot plus a plain English instruction and outputs the exact action to perform — click, type, or scroll — making it suitable for building screen-control automation pipelines without hardcoded UI selectors.
Mirza runs the installation on Ubuntu with an RTX 6000 GPU (48GB VRAM) using vLLM and Hugging Face Transformers, demonstrating the 2-billion parameter variant which comes in under 5GB on disk and loads into under 25GB of VRAM. The model’s development followed four stages: base Qwen3-VL pretraining, large-scale GUI data pretraining, reinforcement learning across separate mobile and web grounding tasks, and final merging of specialized checkpoints into one unified model. Benchmark results show 77.6% on AndroidWorld and 69.6% on ScreenSpot Pro, outperforming GPT-4 on both evaluations. Model sizes span 2B, 8B, and 30B parameters.
The live demo covers coordinate-based output for precise UI element location, handling both mobile and desktop screenshots, and navigating a real YouTube channel page. Mirza notes the model currently performs strongest on Chinese-language apps and recommends the mixture-of-experts flagship variant for production or customer-facing deployments, while the 2B model is sufficient for evaluation and prototyping.
📺 Source: Fahd Mirza · Published March 01, 2026
🏷️ Format: Tutorial Demo







