Descriptions:
Kyle Corbitt — founder of OpenPipe and now head of the serverless training team at CoreWeave following an acquisition — joins Nathan Labenz on the Cognitive Revolution for a comprehensive practitioner’s guide to reinforcement learning fine-tuning. The conversation opens with Labenz sharing his own history of supervised fine-tuning work (including contributions to the emergent misalignment paper) and his hesitation around RL, giving Corbitt a concrete set of premises to probe and update.
The technical core of the interview covers how RL differs from SFT in its weight update mechanics, why that difference makes RL less susceptible to catastrophic forgetting, and what distinguishes DeepSeek’s GRPO algorithm from earlier methods like PPO. Corbitt walks through the improvements practitioners are layering on top of GRPO in industry settings today, explains why Chinese labs’ use of LLM-as-judge in RL post-training pipelines is a more significant development than their SFT distillation strategies, and argues that compute availability — not algorithmic capability — is the primary constraint preventing Chinese frontier labs from closing the gap with American ones.
Practical guidance includes how to design and iterate on evaluation rubrics, the tradeoffs of per-task versus multi-task model training, and why reward hacking tends to be more tractable than feared in narrow-domain settings. CoreWeave’s use of LoRA adapters for customer efficiency is discussed in detail. The episode is one of the most technically grounded public treatments of production RL fine-tuning available, valuable for anyone weighing RL against continued SFT investment.
📺 Source: Cognitive Revolution “How AI Changes Everything” · Published May 01, 2026
🏷️ Format: Interview







