Descriptions:
Dat Ngo, AI architect at Arize AI, presents a structured framework for making LLM systems observable, evaluable, and experimentally improvable — drawing on experience with large enterprise deployments that collectively process between 100 billion and 1 trillion tokens annually. The session is organized around three layers: observability, signal derivation, and experimentation.
On observability, Ngo explains how Arize AX is built on OpenTelemetry as its foundational telemetry standard, using auto-instrumenters that add one line of code to emit traces and spans from any supported framework or SDK. He distinguishes between trace-level visibility (individual tool calls and agent steps), session-level visibility (multi-turn conversation state), and run-level visibility (batch pipeline outcomes) — referencing the Anthropic managed agents paper released two days prior as relevant context.
For evaluation, Ngo outlines five signal types: LLM-as-judge, human annotation, golden datasets, deterministic logic checks (e.g., schema validation), and business metrics. A key practical insight is that fixing one failure in a non-deterministic system frequently introduces two or three regressions elsewhere, making regression suites and continuous eval harnesses essential. The talk also addresses organizational dynamics: how to divide prompt engineering and eval definition between AI engineers and non-technical domain experts within enterprise teams.
📺 Source: AI Engineer · Published June 07, 2026
🏷️ Format: Deep Dive







