GEMINI 3.1 PRO is the new era…

Benchmarks3 months ago

GEMINI 3.1 PRO is the new era…

Descriptions:

Wes Roth reviews Google DeepMind’s newly released Gemini 3.1 Pro, the core reasoning model powering the Gemini ecosystem, with a systematic look at its performance across several new agentic benchmarks that didn’t exist a year ago. The headline number: ARC-AGI 2 abstract reasoning score jumped from 31.1% on Gemini 3 Pro to 77% on Gemini 3.1 Pro in roughly three months — a jump Roth frames as emblematic of how rapidly labs are improving on tasks designed to resist pattern-matching.

On BrowseComp — an OpenAI benchmark released April 2025 that tests agents’ ability to find obscure, entangled facts through persistent web navigation — Gemini 3.1 Pro scores 85.9, edging past Claude Opus 4.6 (84) and GPT-5.2 (84) to take the current top position. Humans solve only ~29% of these tasks. Apex Agents (January 2026) drops models into a simulated office environment with spreadsheets, emails, and Slack-style messaging to produce client-ready output; Gemini 3.1 Pro reaches 41% on the hardest category, showing rapid improvement but not yet reliable enough for unsupervised deployment.

On Terminal Bench 2.0 (November 2025, developed with Stanford), which evaluates agents operating command-line interfaces in Docker sandboxes, Gemini 3.1 Pro scores 68.5 — ahead of Opus 4.6’s 65.4 and GPT-5.2’s 64.7, and a large jump from Gemini 3 Pro’s 56.2. Roth notes that all of these benchmarks are under a year old, reflecting a field-wide pivot toward measuring autonomous task completion over conversational ability.

📺 Source: Wes Roth · Published February 19, 2026
🏷️ Format: Benchmark Test

1 Item

Channels

No Image Available

Wes Roth

Tags

ARC AGI 2 Claude Opus 4.6 Gemini 3 Pro Gemini 3.1 Pro Google GPT 5.2 OpenAI

Prev

GROK 4.20 is… different

GROK 4.20 is… different

Next

NullClaw + Ollama Local Setup with Telegram

NullClaw + Ollama Local Setup with Telegram

18 Related Posts

Related Posts

11:12

Benchmarks

Qwen3.6 27B Gets 20% Faster with MTP and llama.cpp Locally

5 days ago

09:15

Benchmarks

ZAYA1-VL-8B: Efficient Open Visual Intelligence – Run Locally

6 days ago

04:40

Benchmarks

One API Key for Every AI Model (Pay With Crypto)

1 week ago

08:57

Benchmarks

Google Releases Gemma 4 MTP Drafters – Run Locally and DFlash Comparison

1 week ago

08:44

Benchmarks

Are AI Coding Skills Just Hype? I Tested Them

2 weeks ago

11:03

Benchmarks

I Didn’t Expect This: Opus 4.7 vs GPT 5.5

2 weeks ago