Descriptions:
AI Explained’s Philip covers the GPT 5.2 release from OpenAI with nine data points designed to go beyond headline claims, with particular attention to what the benchmarks actually measure and where the numbers can mislead. The central throughline is test-time compute: performance on modern AI benchmarks is increasingly a function of how many tokens a model is allowed to spend thinking, which makes direct comparisons between models running at different compute budgets structurally unreliable.
On GDP Val — OpenAI’s headline benchmark for professional knowledge work across 44 occupations — GPT 5.2 claims to reach expert-level performance at 71% of comparisons. The video carefully unpacks the limitations: tasks must be predominantly digital, are well-specified in advance, and the benchmark explicitly excludes catastrophic errors. On the presenter’s own private SimplesBench, GPT 5.2 scores 57.4% against Gemini 3 Pro’s 76.4% and a human baseline of roughly 84%. On chart reasoning via the Charive benchmark, GPT 5.2 leads at 88.7% versus Gemini 3 Pro’s 81%. Humanity’s Last Exam and GPQA Diamond show roughly tied results between both models at around 45-46%.
The video also flags that OpenAI did not compare GPT 5.2 against Claude Opus 4.5 or Gemini 3 Pro in its own release materials — a change from previous practice the presenter had previously praised. One GPQA lead author is quoted acknowledging that 5-10% of that benchmark’s questions may contain noise. Practical testing on a football results spreadsheet task shows notable performance differences between the $200 Pro tier (GPT 5.2 Pro) and the standard version, raising questions about which model tier headline benchmark numbers actually represent.
📺 Source: AI Explained · Published December 12, 2025
🏷️ Format: Benchmark Test

![The Mathematical Foundations of Intelligence [Professor Yi Ma]](https://frontiermodels.cc/wp-content/uploads/2026/03/the-mathematical-foundations-of-150x150.jpg)





