What Do Models Still Suck At? – Peter Gostev, Arena.ai, BullshitBench

Benchmarks3 weeks ago

What Do Models Still Suck At? – Peter Gostev, Arena.ai, BullshitBench

Descriptions:

In this conference talk, Peter Gostev — head of AI at Moonpig and contributor to Arena.ai — makes the case that benchmark leaderboards are hiding persistent, underappreciated failures in today’s frontier models. Drawing on two distinct data sources, he argues that the steady upward trend in aggregate scores obscures specific capability gaps that matter a lot in production.

The first lens is BullshitBench, Gostev’s own open-source benchmark comprising 155 deliberately nonsensical questions designed to test whether models will push back or simply generate confident-sounding answers to unanswerable prompts. Results show that GPT-4o and Gemini models accept the nonsense roughly 50% of the time, while Claude models and some Qwen variants fare considerably better. The second lens is previously unpublished Arena.ai data, derived from more than 5.5 million human preference votes collected since Q-2 2023 across roughly 700 tracked models. Gostev highlights the “both bad” vote mechanic — where users can flag that neither model gave a good response — as an underused signal for detecting systematic failure modes rather than just relative quality.

The talk is particularly useful for teams building agent pipelines, where a model that blindly executes nonsensical instructions rather than questioning them can cause serious downstream errors. Gostev’s core thesis: model sycophancy and task-at-any-cost training create a fragility that aggregate benchmarks were never designed to catch.

📺 Source: AI Engineer · Published April 24, 2026
🏷️ Format: Benchmark Test

1 Item

Channels

No Image Available

AI Engineer

Tags

Anthropic Arena.ai Claude Sonnet 4.5 Gemini Google Grok OpenAI Qwen

Prev

OpenAI just dropped GPT-5.5… (WOAH)

Next

DeepSeek V4 Pro + Hermes Agent + Telegram: Full-Stack Bug Fixing From Your Phone

18 Related Posts

Related Posts

11:12

Benchmarks

Qwen3.6 27B Gets 20% Faster with MTP and llama.cpp Locally

5 days ago

09:15

Benchmarks

ZAYA1-VL-8B: Efficient Open Visual Intelligence – Run Locally

6 days ago

04:40

Benchmarks

One API Key for Every AI Model (Pay With Crypto)

1 week ago

08:57

Benchmarks

Google Releases Gemma 4 MTP Drafters – Run Locally and DFlash Comparison

1 week ago

08:44

Benchmarks

Are AI Coding Skills Just Hype? I Tested Them

2 weeks ago

11:03

Benchmarks

I Didn’t Expect This: Opus 4.7 vs GPT 5.5

2 weeks ago