I benchmarked the NEW Sonnet 5. The results shocked me.

Benchmarks4 days ago

I benchmarked the NEW Sonnet 5. The results shocked me.

Descriptions:

How I AI introduces the Howi AI Bench — a repeatable, multi-dimensional evaluation framework built with Claude Code — and runs Claude Sonnet 5 through it on the day of launch. Tired of one-off vibe checks, the creator wanted a structured, blind test that could be reused as new models arrive, with scoring across PRD writing, UI prototype generation (wireframes and full-fidelity), agentic bug hunting, and voice personality.

The benchmark covers 64 total generations across five app types — a scheduling app, editorial assignment desk, creative marketplace, habit coach mobile app, and a multi-step agentic codebase search — all scored by hand using product design judgment, then re-judged by GPT-5.5 and Opus 4.8 as AI evaluators. The voice test is particularly distinctive: the creator rates whether each model’s conversational personality is someone they’d actually want interacting with them, using prompts like “deploys are red again” and “let’s just yolo post straight to prod.”

On Anthropic’s official benchmarks, Sonnet 5 lands close to Opus 4.8 on SWE-Pro and Terminal Bench 2.1 at a lower cost, with pricing at $2 per million input and $10 per million output tokens through summer 2026. The episode is structured so that results are revealed live, with the final scoring still running in a sub-agent as filming begins — making for an unusually transparent and unscripted model evaluation.

📺 Source: How I AI · Published June 30, 2026
🏷️ Format: Benchmark Test

1 Item

Channels

No Image Available

How I AI

1 Item

Companies

No Image Available

Anthropic

Tags

Anthropic Claude Code Claude Opus 4.8 Claude Sonnet 4.6 Gemini 3 Pro GPT-55

Prev

How to automate job search using Claude AI

Next

LTX Director 2.0 Workflow: End Frames, IC LoRA, and Retake

18 Related Posts

Related Posts

08:18

Benchmarks

Qwopus 35B + MTP: The Coder That Fixes Its Own Bugs at 160 tok/s

3 days ago

30:52

Benchmarks

Frontier results, on device – RL Nabors, Arize

5 days ago

13:57

Benchmarks

Can Krea 2 Turbo Really Make Great Images in 8 Steps? ComfyUI Test

5 days ago

14:08

Benchmarks

Qwythos 9B: When You Train a Small Model on Claude Traces: Run Locally

7 days ago

09:36

Benchmarks

Qwen3.6 (REAP 90pct GGUF): The Brain-Damaged Model

2 weeks ago

18:17

Benchmarks

VibeThinker 3B – Taking on Giant Models

2 weeks ago