Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Business & Strategy2 months ago

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Descriptions:

Nicholas Kang, product manager for Kaggle Benchmarks at Google DeepMind, and Michael Aaron, a Kaggle software engineer, present the case for why AI evaluations are systemically broken—and what Kaggle is building to fix it. Speaking at the AI Engineer conference, they identify three structural failures: benchmarks are decentralized and go stale fast (over 10 new ones appear daily on arXiv), eval results lack transparency and reproducibility (a competing lab re-ran a Kaggle benchmark using their own API’s prompt caching and published substantially better numbers), and the roughly 30,000 AI researchers creating all benchmarks cannot possibly cover the full surface area of capabilities needed to serve 30 million technical users and billions of eventual AI consumers.

Kaggle’s answer is a two-part platform. The first is a community hackathon system leveraging Kaggle’s 30+ million users to create novel evaluations, with infrastructure for dataset hosting, model API access, and collaborative writeups. The second—and newer—is Standardized Agent Exams, described as an experimental MVP launched the week of the talk: developers pass a single one-line prompt to their agent, it takes an exam, and receives a scored leaderboard result. The team frames this as a lightweight safety baseline check before deploying agents to manage real-world tasks like inboxes or e-commerce accounts.

The talk is candid about unsolved problems, including the difficulty of using AI judges to evaluate creativity and innovation, and the challenge of aligning human expert reviewers. For teams building or evaluating agents, the Standardized Agent Exams product is the most immediately actionable takeaway.

📺 Source: AI Engineer · Published May 25, 2026
🏷️ Format: Keynote Launch

1 Item

Channels

No Image Available

AI Engineer

1 Item

Companies

No Image Available

DeepMind

Tags

DeepMind Grok SWE-bench

Prev

AI’s New Acceleration Phase

Next

Why Agents Still Need Humans

18 Related Posts

Related Posts

08:40

Business & Strategy

AI Job Apocalypse: What They’re Not Telling You

2 hours ago

20:24

Business & Strategy

First Steps Toward Automated AI Research — Richard Socher, CEO Recursive AI

2 hours ago

07:31

Business & Strategy

How to Price AI Automations Without Underselling Yourself

2 hours ago

20:07

Business & Strategy

Morgan Stanley’s ALPHALAB: Multi-Agent Research Across Optimization Domains — Brendan Rappazzo

1 day ago

04:47

Business & Strategy

Did Anthropic just kill the indie hacker…?

1 day ago

46:20

Business & Strategy

50 Ways To Make $1M With Hermes Agent

1 day ago