Pi Coding Agent Observability: HTML Specs with Gemini 3.5 Flash and GPT Image 2

Benchmarks2 months ago

Pi Coding Agent Observability: HTML Specs with Gemini 3.5 Flash and GPT Image 2

Descriptions:

IndyDevDan, an engineer with 15 years of experience, runs a structured comparison of three specification formats for AI coding agents — markdown, HTML, and an enhanced ‘VSpec’ that incorporates visual components — using three parallel Gemini 3.5 Flash agents instrumented with a custom PI observability dashboard. The experiment is motivated by Anthropic’s viral post on the ‘unreasonable effectiveness of HTML’ and OpenAI’s GPT Image 2, asking whether denser, richer specs actually produce better agent behavior when cost and speed are factored in.

Each agent receives the same underlying task: build a planning spec for a ‘Steelman’ product agent that generates UI-backed counter-arguments for investment theses. The observability dashboard streams every event, tool call, and turn from all three agents in real time, making it possible to compare not just final output quality but the internal reasoning process. Results show the markdown agent completed its planning phase in 29 turns, the HTML agent in 25 turns, and the enhanced HTML agent in just 17 turns — though token totals were counterintuitively higher for the markdown agent, suggesting it explored the codebase more thoroughly.

The second half of the video demonstrates the product agent live, generating a bear-case steelman for Apple as an AI distribution play, complete with dynamically generated UI components including a pie chart of Mac Mini revenue versus Apple’s broader product lines. Witteveen’s key thesis is that agent observability is not optional infrastructure — it is the only way to understand why different prompts produce different behaviors at scale.

📺 Source: IndyDevDan · Published June 01, 2026
🏷️ Format: Benchmark Test

1 Item

Channels

No Image Available

IndyDevDan

1 Item

People

No Image Available

IndyDevDan

Tags

Anthropic Gemini Flash 3.5 Gemini Omni Google GPT Image 2 GPT-55 IndyDevDan OpenAI Pi Coding Agent

Prev

Microsoft Says 86% Treat AI Output as a Starting Point. Your Resume Just Stopped Working.

Next

The BEST AI for 4K images. Free & fast

18 Related Posts

Related Posts

16:29

Benchmarks

Opus 5 vs GPT-5.6 On Polymarket Predictions — Week 1

23 hours ago

11:15

Benchmarks

Single Photo vs. Character Sheet: The LTX 2.3 Best Face ID Secret

23 hours ago

13:14

Benchmarks

Qwen-Audio-3.0-TTS Tested: 16 Languages, Instruction Control & Emotion Tags

6 days ago

21:31

Benchmarks

Is Kimi K3 Really That Good?! (Don’t Just Believe The Hype)

6 days ago

10:49

Benchmarks

Ling 3.0 Flash: A Production-Scale Coding Agentic Model

7 days ago

08:48

Benchmarks

Catmind-1.2b: A Reasoning Model that Thinks in Cat Stories

1 week ago