Descriptions:
Josh McGrath, a post-training researcher at OpenAI working on thinking models, joins Latent Space for a candid discussion of what has changed between GPT-4.1 and GPT-5.1 and why post-training has become the most active frontier of model improvement. Recorded at the end of 2025, the conversation covers the infrastructure complexity of reinforcement learning at scale—contrasting it with pre-training where the core challenge is token throughput, versus RL where each task may carry its own grading system, external code dependencies, and failure modes that can only be diagnosed at midnight during a live run.
A central theme is reward signal quality as the underappreciated foundation of RL progress. McGrath argues that GRPO and JPO (introduced in the DeepSeek Math paper) are meaningful optimization advances, but the more fundamental breakthrough is the identification of reward signals—like mathematical correctness verification—that are reliably ground-truth checkable. Unlike human preference labels, verifiable rewards remove the ambiguity that makes RL noisy, and McGrath believes this insight is underrepresented in the published literature relative to its actual impact on model behavior.
The episode also covers long-horizon task planning (framed in token budgets rather than wall-clock time for cleaner optimization targets), OpenAI’s shopping model experiments following Black Friday, and how Codex has transformed McGrath’s own daily workflow by compressing multi-hour implementation tasks to 15 minutes—creating a new productivity meta he is still adapting to. For anyone tracking how frontier labs approach post-training, this is a rare first-person account from inside one of the teams building these systems.
📺 Source: Latent Space · Published December 31, 2025
🏷️ Format: Podcast







