Descriptions:
Adit Abraham, CEO of Reductto, presents at AI Dev 26 x SF on one of the most underappreciated bottlenecks in enterprise AI deployments: the quality of data fed to agents. Reductto, backed by Andreessen Horowitz and Benchmark with $108 million raised, has processed over 3 billion documents for clients including Fortune 10 companies, major hedge funds, and AI-native firms like Harvey, Scale AI, and Rogo.
Abrahim walks through why PDFs remain stubbornly difficult to parse accurately despite decades of work on the problem, and frames the core tension between traditional computer vision OCR (deterministic, bounding-box-preserving, fast) and frontier language models (context-aware but prone to “reasoning” on content mid-extraction, such as computing totals instead of reading them). Reductto’s answer is a technique they call agentic OCR, which applies speculative decoding — a single forward pass that identifies token-level corrections — to get accuracy improvements while preserving the structural characteristics that downstream agents depend on.
The talk also covers best practices for formatting extracted data: markdown works well for clean tabular data because LLMs reason on it efficiently, but complex layouts with merged cells or nested structures require different representations. Abraham argues that teams routinely stop at extraction and overlook how output format shapes agent performance. He closes with a forward-looking discussion of confidence scoring to triage documents for agent-in-the-loop versus human-in-the-loop review, and the path toward human-level performance on the hardest document understanding tasks.
📺 Source: DeepLearningAI · Published May 20, 2026
🏷️ Format: Deep Dive







