GEMINI 3.1 PRO is the new era…

GEMINI 3.1 PRO is the new era…

More

Descriptions:

Wes Roth reviews Google DeepMind’s newly released Gemini 3.1 Pro, the core reasoning model powering the Gemini ecosystem, with a systematic look at its performance across several new agentic benchmarks that didn’t exist a year ago. The headline number: ARC-AGI 2 abstract reasoning score jumped from 31.1% on Gemini 3 Pro to 77% on Gemini 3.1 Pro in roughly three months — a jump Roth frames as emblematic of how rapidly labs are improving on tasks designed to resist pattern-matching.

On BrowseComp — an OpenAI benchmark released April 2025 that tests agents’ ability to find obscure, entangled facts through persistent web navigation — Gemini 3.1 Pro scores 85.9, edging past Claude Opus 4.6 (84) and GPT-5.2 (84) to take the current top position. Humans solve only ~29% of these tasks. Apex Agents (January 2026) drops models into a simulated office environment with spreadsheets, emails, and Slack-style messaging to produce client-ready output; Gemini 3.1 Pro reaches 41% on the hardest category, showing rapid improvement but not yet reliable enough for unsupervised deployment.

On Terminal Bench 2.0 (November 2025, developed with Stanford), which evaluates agents operating command-line interfaces in Docker sandboxes, Gemini 3.1 Pro scores 68.5 — ahead of Opus 4.6’s 65.4 and GPT-5.2’s 64.7, and a large jump from Gemini 3 Pro’s 56.2. Roth notes that all of these benchmarks are under a year old, reflecting a field-wide pivot toward measuring autonomous task completion over conversational ability.


📺 Source: Wes Roth · Published February 19, 2026
🏷️ Format: Benchmark Test

1 Item

Channels