Descriptions:
How I AI introduces the Howi AI Bench — a repeatable, multi-dimensional evaluation framework built with Claude Code — and runs Claude Sonnet 5 through it on the day of launch. Tired of one-off vibe checks, the creator wanted a structured, blind test that could be reused as new models arrive, with scoring across PRD writing, UI prototype generation (wireframes and full-fidelity), agentic bug hunting, and voice personality.
The benchmark covers 64 total generations across five app types — a scheduling app, editorial assignment desk, creative marketplace, habit coach mobile app, and a multi-step agentic codebase search — all scored by hand using product design judgment, then re-judged by GPT-5.5 and Opus 4.8 as AI evaluators. The voice test is particularly distinctive: the creator rates whether each model’s conversational personality is someone they’d actually want interacting with them, using prompts like “deploys are red again” and “let’s just yolo post straight to prod.”
On Anthropic’s official benchmarks, Sonnet 5 lands close to Opus 4.8 on SWE-Pro and Terminal Bench 2.1 at a lower cost, with pricing at $2 per million input and $10 per million output tokens through summer 2026. The episode is structured so that results are revealed live, with the final scoring still running in a sub-agent as filming begins — making for an unusually transparent and unscripted model evaluation.
📺 Source: How I AI · Published June 30, 2026
🏷️ Format: Benchmark Test







