MolmoWeb – Fully Open Multimodal Web Agents – Control Browser Locally

MolmoWeb – Fully Open Multimodal Web Agents – Control Browser Locally

More

Descriptions:

MolmoWeb is a fully open-source visual web agent released by the Allen Institute for AI (AI2) that autonomously controls a real web browser using only screenshot-based visual perception — no HTML parsing, no structured page data, just pixels, exactly as a human would navigate. In this hands-on walkthrough, Fahd Mirza installs and runs MolmoWeb locally on an Ubuntu machine equipped with an NVIDIA RTX 6000 GPU (48GB VRAM), consuming approximately 17GB of VRAM during inference with the 8-billion-parameter variant.

The installation process is demonstrated step-by-step using UV (a Python package manager), Playwright, and headless Chromium to download weights from Hugging Face and serve the model locally on port 8001. A live test task — finding the cheapest non-stop flight from Sydney to Jakarta in May 2026 — shows the agent opening a browser, entering values, navigating search results, and returning a detailed answer after 25 steps, with a full HTML trajectory log capturing every screenshot, thought, and action taken.

What distinguishes MolmoWeb from most web agents is its complete openness: model weights, training datasets, and evaluation tools are all publicly released, which remains rare in this space. Despite its relatively compact size, AI2 reports that MolmoWeb outperforms agents built on top of much larger closed models on several benchmarks. The video serves as a practical guide for developers interested in running a capable, locally-hosted, open-weight browser agent without depending on proprietary APIs.


📺 Source: Fahd Mirza · Published March 25, 2026
🏷️ Format: Hands On Build

1 Item

Channels