Descriptions:
Yegor Denisov-Blanch from Stanford presents findings from a two-year, large-scale study tracking AI’s impact on software engineering productivity across real enterprise teams. The research uses a machine learning model trained to replicate panels of 10–15 independent human expert evaluators — scoring code commits on implementation time, maintainability, and complexity — enabling measurement at scale without manual review bottlenecks.
Key findings: a matched comparison of 46 AI-using teams against 46 non-AI teams shows a median 10% productivity lift as of mid-2025, with a widening gap between top and bottom performers. Critically, raw AI token usage correlates weakly with outcomes (R²=0.20), while a composite “environment cleanliness index” measuring test coverage, type annotations, documentation, modularity, and code quality correlates far more strongly (R²=0.40). Teams with messy codebases see poor returns even with heavy AI usage — and unchecked AI usage can accelerate codebase entropy.
Denisov-Blanch also introduces an open-source AI practices benchmark with five maturity levels ranging from zero AI use to full agentic orchestration, and illustrates how two business units with identical AI tool access and licensing showed dramatically different adoption rates and outcomes. The talk offers a data-driven framework for enterprise leaders who need to measure AI ROI in engineering, move beyond vanity metrics like token spend, and identify which cohort their organization is in before the productivity gap widens further.
📺 Source: AI Engineer · Published December 11, 2025
🏷️ Format: Benchmark Test







