The RL Fine-Tuning Playbook: CoreWeave’s Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Interviews2 weeks ago

The RL Fine-Tuning Playbook: CoreWeave’s Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Descriptions:

Kyle Corbitt — founder of OpenPipe and now head of the serverless training team at CoreWeave following an acquisition — joins Nathan Labenz on the Cognitive Revolution for a comprehensive practitioner’s guide to reinforcement learning fine-tuning. The conversation opens with Labenz sharing his own history of supervised fine-tuning work (including contributions to the emergent misalignment paper) and his hesitation around RL, giving Corbitt a concrete set of premises to probe and update.

The technical core of the interview covers how RL differs from SFT in its weight update mechanics, why that difference makes RL less susceptible to catastrophic forgetting, and what distinguishes DeepSeek’s GRPO algorithm from earlier methods like PPO. Corbitt walks through the improvements practitioners are layering on top of GRPO in industry settings today, explains why Chinese labs’ use of LLM-as-judge in RL post-training pipelines is a more significant development than their SFT distillation strategies, and argues that compute availability — not algorithmic capability — is the primary constraint preventing Chinese frontier labs from closing the gap with American ones.

Practical guidance includes how to design and iterate on evaluation rubrics, the tradeoffs of per-task versus multi-task model training, and why reward hacking tends to be more tractable than feared in narrow-domain settings. CoreWeave’s use of LoRA adapters for customer efficiency is discussed in detail. The episode is one of the most technically grounded public treatments of production RL fine-tuning available, valuable for anyone weighing RL against continued SFT investment.

📺 Source: Cognitive Revolution “How AI Changes Everything” · Published May 01, 2026
🏷️ Format: Interview

2 Items

Companies

No Image Available

CoreWeave

No Image Available

DeepSeek

Tags

Anthropic CoreWeave DeepSeek DeepSeek R1 Google Meta OpenAI

Prev

Claude vs ChatGPT: Which AI Trades the Best?

Next

How to Use Claude Code for FREE (2026)

18 Related Posts

Related Posts

08:44

Interviews

AI Chipmaker Cerebras Raises $5.55 Billion in Year’s Biggest IPO

1 day ago

01:06:38

Interviews

Inside Abridge: The AI Listening to 100 Million Doctor Visits — Abridge’s Janie Lee & Chai Asawa

1 day ago

16:39

Interviews

How Emergent is making app building more accessible with Claude

2 days ago

01:16:02

Interviews

TypeScript, C# and Turbo Pascal with Anders Hejlsberg

2 days ago

23:34

Interviews

The Founders Who Left Tesla to Rebuild America | a16z

2 days ago

46:56

Interviews

“There Is No Task Agents Cannot Do” – Magnus Müller

2 days ago