Descriptions:
Kobie Crawford, developer advocate at Snorkel AI, presents original research from the company’s frontier AI data lab quantifying how task quality affects reinforcement learning outcomes for agentic models — and the results make a strong empirical case for rigorous data curation over volume.
Snorkel defines task quality for containerized agentic environments (using frameworks like Harbor and OpenEnv) across four criteria: tasks must be achievable, non-trivial, functionally correct, and run inside a reliably reproducible environment. Tasks passing all four are marked “accepted”; those failing any criterion are “rejected.” To test whether this acceptance filter actually predicts training value, Snorkel ran two parallel RL training runs using the same base model, the same compute budget, and equal numbers of tasks from each bucket. The base models used were Claude 3.5 Sonnet and OpenAI Codex.
The performance gap was striking: training on low-quality (rejected) tasks produced roughly 1% improvement on held-out benchmarks, while training on high-quality (accepted) tasks produced approximately 6% improvement — a roughly 5x uplift from quality alone, with identical compute. Crawford argues this validates Snorkel’s founding thesis that data quality is the critical variable in model improvement, and that as the industry moves deeper into agentic RL pipelines with terminal-bench-style tasks, having human experts in the loop during task generation is not a luxury but a prerequisite for meaningful capability gains.
📺 Source: AI Engineer · Published June 02, 2026
🏷️ Format: Deep Dive







