Descriptions:
Phil Hetzel, head of solutions engineering at Braintrust, walks through the maturity phases teams go through as they build out evaluation pipelines for AI agents. Drawing on his background at KPMG and Slalom Consulting — where he saw countless generative AI proofs of concept fail to reach production — Hetzel argues that the gap between a working demo and a production-ready agent almost always comes down to the quality of evals.
The talk outlines a progression that starts with pure human annotation (thumbs up/down with written justifications), moves into using those justifications to surface failure modes, and ultimately scales to LLM-as-judge techniques. A key insight is that evals are not unit tests: you can’t enumerate every possible input, so the goal is to identify the most important failure modes first and build targeted scorers around them rather than aiming for exhaustive coverage.
Hetzel also introduces the Braintrust platform’s annotation tooling — including custom annotation views and the ability to live-code evaluation interfaces — and closes with a forward-looking take on where agentic observability is heading. The session is practical rather than product-focused, making it useful for any team trying to graduate from ad hoc testing to a repeatable eval discipline.
📺 Source: AI Engineer · Published May 27, 2026
🏷️ Format: Deep Dive







