Descriptions:
Aman Khan, AI Product Manager at Arize, delivers a workshop-style talk at AI Engineer on building a practical evaluation framework for AI product managers. Drawing on his background running evaluation systems for self-driving cars at Cruise, recommendation systems at Spotify, and now at Arize — a platform used by companies like Uber, Instacart, Reddit, and Duolingo — Khan guides attendees through the concepts and tooling needed to ship AI that actually works in production.
The session covers what evals are, why they matter, and how to move beyond manual spreadsheet-based annotation toward automated, scalable evaluation pipelines. Khan walks through building a multi-agent AI trip planner live, then demonstrates evaluating it using the Arize platform — running A/B prompt comparisons across a dataset of 12 examples using an LLM-as-judge approach. He introduces prompt versioning in Arize as a GitHub-style store for prompt iterations, allowing teams to track and deploy specific prompt versions through code.
Key takeaways include treating evals as human annotation scaled up rather than replaced, the importance of observability for agent-based applications, and a practical Dunning-Krueger framing for where AI PMs sit on the learning curve. The talk is especially relevant for product teams at companies beginning to move from prototype to production AI systems.
📺 Source: AI Engineer · Published December 26, 2025
🏷️ Format: Tutorial Demo







