Descriptions:
Or Dagan from AI21 Labs walks through the practical challenge every production agent team faces: accuracy, cost, and latency cannot all be maximized simultaneously, and optimizing one typically degrades another. His talk at AI Dev SF 2026 presents a structured, data-driven methodology for navigating this tradeoff space, using BrowseComp+—a deep research benchmark—as the running example, where AI21 achieved a new state-of-the-art result.
The framework divides optimization into two buckets. Configuration-level work covers model selection, prompt tuning (including automated methods like DSPy), and tool composition, and quickly surfaces a Pareto frontier of best-performing configurations plotted against cost or latency. Scaling techniques go further: vertical scaling via longer reasoning chains or critique-repair loops, and horizontal scaling through best-of-N sampling and multi-model ensembles. The ensemble approach is particularly striking—using a diverse set of models of different sizes reduces cost by more than half and latency by 20% compared to running the best single model multiple times, because different models solve different subsets of tasks.
Dagan closes by acknowledging the combinatorial explosion that results from mixing models, tools, prompts, ensemble sizes, and execution strategies. He frames this as an automated optimization problem rather than a manual search, positioning AI21’s tooling as a solution to what currently costs teams months and thousands of dollars in manual experimentation. The session is dense with specific numbers and is one of the more technically rigorous agent optimization talks at the conference.
📺 Source: DeepLearningAI · Published May 22, 2026
🏷️ Format: Deep Dive







