[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

Interviews5 months ago

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

Descriptions:

Josh McGrath, a post-training researcher at OpenAI working on thinking models, joins Latent Space for a candid discussion of what has changed between GPT-4.1 and GPT-5.1 and why post-training has become the most active frontier of model improvement. Recorded at the end of 2025, the conversation covers the infrastructure complexity of reinforcement learning at scale—contrasting it with pre-training where the core challenge is token throughput, versus RL where each task may carry its own grading system, external code dependencies, and failure modes that can only be diagnosed at midnight during a live run.

A central theme is reward signal quality as the underappreciated foundation of RL progress. McGrath argues that GRPO and JPO (introduced in the DeepSeek Math paper) are meaningful optimization advances, but the more fundamental breakthrough is the identification of reward signals—like mathematical correctness verification—that are reliably ground-truth checkable. Unlike human preference labels, verifiable rewards remove the ambiguity that makes RL noisy, and McGrath believes this insight is underrepresented in the published literature relative to its actual impact on model behavior.

The episode also covers long-horizon task planning (framed in token budgets rather than wall-clock time for cleaner optimization targets), OpenAI’s shopping model experiments following Black Friday, and how Codex has transformed McGrath’s own daily workflow by compressing multi-hour implementation tasks to 15 minutes—creating a new productivity meta he is still adapting to. For anyone tracking how frontier labs approach post-training, this is a rare first-person account from inside one of the teams building these systems.

📺 Source: Latent Space · Published December 31, 2025
🏷️ Format: Podcast

1 Item

Companies

No Image Available

OpenAI

Tags

Codex DeepSeek GPT-4.1 GPT-5 GPT-5.1 OpenAI

Prev

I was using Claude Code wrong… then I discovered this

I was using Claude Code wrong… then I discovered this

Next

How I Added Vector Search to my Course (Postgres + PGVector )

How I Added Vector Search to my Course (Postgres + PGVector )

18 Related Posts

Related Posts

08:44

Interviews

AI Chipmaker Cerebras Raises $5.55 Billion in Year’s Biggest IPO

1 day ago

01:06:38

Interviews

Inside Abridge: The AI Listening to 100 Million Doctor Visits — Abridge’s Janie Lee & Chai Asawa

1 day ago

23:34

Interviews

The Founders Who Left Tesla to Rebuild America | a16z

2 days ago

46:56

Interviews

“There Is No Task Agents Cannot Do” – Magnus Müller

2 days ago

16:39

Interviews

How Emergent is making app building more accessible with Claude

2 days ago

01:16:02

Interviews

TypeScript, C# and Turbo Pascal with Anders Hejlsberg

2 days ago