Descriptions:
In this conference talk, Peter Gostev — head of AI at Moonpig and contributor to Arena.ai — makes the case that benchmark leaderboards are hiding persistent, underappreciated failures in today’s frontier models. Drawing on two distinct data sources, he argues that the steady upward trend in aggregate scores obscures specific capability gaps that matter a lot in production.
The first lens is BullshitBench, Gostev’s own open-source benchmark comprising 155 deliberately nonsensical questions designed to test whether models will push back or simply generate confident-sounding answers to unanswerable prompts. Results show that GPT-4o and Gemini models accept the nonsense roughly 50% of the time, while Claude models and some Qwen variants fare considerably better. The second lens is previously unpublished Arena.ai data, derived from more than 5.5 million human preference votes collected since Q-2 2023 across roughly 700 tracked models. Gostev highlights the “both bad” vote mechanic — where users can flag that neither model gave a good response — as an underused signal for detecting systematic failure modes rather than just relative quality.
The talk is particularly useful for teams building agent pipelines, where a model that blindly executes nonsensical instructions rather than questioning them can cause serious downstream errors. Gostev’s core thesis: model sycophancy and task-at-any-cost training create a fragility that aggregate benchmarks were never designed to catch.
📺 Source: AI Engineer · Published April 24, 2026
🏷️ Format: Benchmark Test







