Descriptions:
Rhythm Garg and Linden Li, co-founders of Applied Compute and former OpenAI researchers, present a technical deep-dive into building a fast, cost-predictable reinforcement learning stack for enterprise deployment. Unlike lab-scale RL runs that span weeks, Applied Compute targets training jobs that complete in days with low variance on delivery time — a business-critical requirement when working with enterprise customers on contracted timelines.
The core problem they diagnose is GPU idle time in synchronous RL: all samples in a training batch must complete before the next training step begins, so the slowest sample dictates step time. Their measurement is concrete — 40 arithmetic problems, 32 samples each, using Qwen-30B: 99% of samples finished in approximately 40 seconds, but the final 1% required another 80 seconds, a long tail that leaves GPUs idle and wastes substantial compute. Their solution is asynchronous RL, which decouples sampling from training and allows configurable allocation of GPU budget between the two phases.
The talk then walks through the systems modeling required to optimize this allocation — modeling sampling throughput as a function of KV cache batch size, fitting latency curves as a function of inference batch size, and reasoning about how staleness tolerance (training on slightly out-of-date samples) trades off against throughput. Applied Compute’s approach is positioned as enabling enterprises to train use-case-specific models that improve over time via a data flywheel — delivering the kind of specialized reasoning capabilities previously only available to organizations with large-scale lab infrastructure.
📺 Source: AI Engineer · Published December 09, 2025
🏷️ Format: Deep Dive







