Descriptions:
Google’s release of Gemini 3.1 Pro is the focus of this video from Matthew Berman, which breaks down the model’s benchmark performance and real-world capabilities in detail. On ARC-AGI 2—a test measuring rapid skill acquisition and generalization—Gemini 3.1 Pro scores 77.1%, more than doubling its predecessor Gemini 3 Pro and edging out Anthropic’s Opus 4.6 at 68.8%. Other headline numbers include 94.3% on GPQA Diamond (scientific knowledge), 80.6% on SWEBench Verified (coding), 99.3% on T2Bench (agentic tool use), and 51.4% on Humanity’s Last Exam when run with a code environment—putting it squarely in competition with the top frontier models.
Beyond benchmarks, the video showcases dramatically improved SVG generation, with Google DeepMind Chief Scientist Jeff Dean demonstrating applications including a geographic urban planning simulator and a prompt-to-CAD-model tool. Berman also notes that Gemini Deep Think—released the prior week—was confirmed to run on Gemini 3.1 Pro under the hood, and that the model is rolling out across Google’s consumer and developer products.
Berman rounds out the analysis with a frank comparison against Anthropic’s Sonnet 4.6, which he calls his current favorite for knowledge work despite its high cost, and reflects on briefly using Gemini 3 Pro as his primary model. The overall takeaway is that Gemini 3.1 Pro is a top-tier model for complex reasoning tasks, though real-world usability will depend on hands-on testing beyond benchmarks.
📺 Source: Matthew Berman · Published February 20, 2026
🏷️ Format: Review







