Descriptions:
Nicholas Kang, product manager for Kaggle Benchmarks at Google DeepMind, and Michael Aaron, a Kaggle software engineer, present the case for why AI evaluations are systemically broken—and what Kaggle is building to fix it. Speaking at the AI Engineer conference, they identify three structural failures: benchmarks are decentralized and go stale fast (over 10 new ones appear daily on arXiv), eval results lack transparency and reproducibility (a competing lab re-ran a Kaggle benchmark using their own API’s prompt caching and published substantially better numbers), and the roughly 30,000 AI researchers creating all benchmarks cannot possibly cover the full surface area of capabilities needed to serve 30 million technical users and billions of eventual AI consumers.
Kaggle’s answer is a two-part platform. The first is a community hackathon system leveraging Kaggle’s 30+ million users to create novel evaluations, with infrastructure for dataset hosting, model API access, and collaborative writeups. The second—and newer—is Standardized Agent Exams, described as an experimental MVP launched the week of the talk: developers pass a single one-line prompt to their agent, it takes an exam, and receives a scored leaderboard result. The team frames this as a lightweight safety baseline check before deploying agents to manage real-world tasks like inboxes or e-commerce accounts.
The talk is candid about unsolved problems, including the difficulty of using AI judges to evaluate creativity and innovation, and the challenge of aligning human expert reviewers. For teams building or evaluating agents, the Standardized Agent Exams product is the most immediately actionable takeaway.
📺 Source: AI Engineer · Published May 25, 2026
🏷️ Format: Keynote Launch







