Why building eval platforms is hard — Phil Hetzel, Braintrust

Why building eval platforms is hard — Phil Hetzel, Braintrust

More

Descriptions:

Phil Hetzel, head of solutions engineering at Braintrust, delivers a structured conference talk on the practical challenges of building evaluation platforms for AI agents — drawing on his vantage point across Braintrust’s entire customer base and twelve years of prior consulting work at KPMG and Slalom. The core argument: without rigorous evals, shipping AI agents to production carries serious brand, compliance, and operational risk, and most teams plateau at a stage of eval maturity that produces reports rather than enabling real iteration.

Hetzel maps out a progression of eval platform stages: spreadsheet-based logging (fast to start, impossible to scale), custom-coded UIs with basic databases like Neon (more accessible, still anchored in reporting), and finally true experimentation environments where both technical and non-technical stakeholders can compare agent configurations side by side and surface scored results. At each stage he identifies the specific bottleneck that prevents teams from moving faster — and is candid about where the “vibe-coded eval UI” phase breaks down in practice.

A recurring theme is that evals are fundamentally a team sport. Domain experts and product stakeholders who understand users closely have knowledge that engineers can’t replicate, and the best eval platforms are designed to bring those voices in without requiring them to touch a spreadsheet or an SDK. The talk is a practical roadmap for any AI team navigating the gap between working proof-of-concept and confident production deployment, with Braintrust’s platform serving as one concrete example of the destination.


📺 Source: AI Engineer · Published April 28, 2026
🏷️ Format: Deep Dive

1 Item

Channels