The maturity phases of running evals — Phil Hetzel, Braintrust

Foundation Models2 months ago

The maturity phases of running evals — Phil Hetzel, Braintrust

Descriptions:

Phil Hetzel, head of solutions engineering at Braintrust, walks through the maturity phases teams go through as they build out evaluation pipelines for AI agents. Drawing on his background at KPMG and Slalom Consulting — where he saw countless generative AI proofs of concept fail to reach production — Hetzel argues that the gap between a working demo and a production-ready agent almost always comes down to the quality of evals.

The talk outlines a progression that starts with pure human annotation (thumbs up/down with written justifications), moves into using those justifications to surface failure modes, and ultimately scales to LLM-as-judge techniques. A key insight is that evals are not unit tests: you can’t enumerate every possible input, so the goal is to identify the most important failure modes first and build targeted scorers around them rather than aiming for exhaustive coverage.

Hetzel also introduces the Braintrust platform’s annotation tooling — including custom annotation views and the ability to live-code evaluation interfaces — and closes with a forward-looking take on where agentic observability is heading. The session is practical rather than product-focused, making it useful for any team trying to graduate from ad hoc testing to a repeatable eval discipline.

📺 Source: AI Engineer · Published May 27, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

AI Engineer

Tags

AI Engineer BrainTrust Codex Cursor Databricks KPMG

Prev

Beating the AI Doom Cycle

Next

Microsoft Lens in ComfyUI: Tiny Model, Big Images|5 Lens Tests: Realism, Text, Prompts & More

18 Related Posts

Related Posts

21:09

Foundation Models

Persona Engineering: A Field Guide to AI Synthetic Personas — Ishan Anand, InsightSciences.ai

1 day ago

21:39

Foundation Models

Serving 2 Million Models Without Melting: Scaling the Hugging Face Hub — Arek Borucki, Hugging Face

2 days ago

06:40

Foundation Models

AMD Releases First Ever AI model: Instella-MoE-16B-A3B-Think

2 days ago

24:01

Foundation Models

US AI Dominance Is Over: Here’s Why

3 days ago

17:31

Foundation Models

The Messy Reality of Scale: Synthetic Data and Pre-Training — Marah Abdin & Robert McHardy, poolside

4 days ago

20:24

Foundation Models

From Agent Traces to Agent Simulations — Rustem Feyzkhanov, Snorkel AI

5 days ago