Real World AI Evaluations

Real World AI Evaluations

More

Descriptions:

Artificial Analysis has built an open evaluation harness on top of OpenAI’s GDPVal benchmark—designed to measure AI performance across 44 occupations on economically meaningful knowledge work tasks—making it runnable on any LLM at scale using an AI grading pipeline. In their initial run, Anthropic’s Claude Opus 4.5 topped the leaderboard at a cost of $68, followed by GPT-5 in second, Claude Sonnet 4.5 in third, GPT-5.1 in fourth (using half the tokens with a slight quality drop), and Deepseek 3.2 and Gemini 3 Pro tied for fifth. Deepseek 3.2 stood out for cost efficiency, completing the benchmark for $29—roughly one-twentieth the cost of Opus.

The episode’s biggest story, however, is a report from The Information claiming that Deepseek built a Blackwell GPU training cluster using chips smuggled into China. According to six sources with knowledge of the matter, Nvidia servers were delivered to third-country data centers, inspected for export compliance, then dismantled and transported into China as individual components. Nvidia disputed the report but acknowledged they “pursue any tip we receive.” If accurate, this would mark the first confirmed instance of a Chinese lab building a commercial-scale training cluster on export-banned hardware—a significant escalation in the chip war.

The episode rounds out with Beijing holding emergency meetings with Alibaba, ByteDance, and Tencent to assess H200 import demand, and a note that ChatGPT is approaching 900 million weekly active users.


📺 Source: The AI Daily Brief: Artificial Intelligence News · Published December 15, 2025
🏷️ Format: News Analysis

4 Items

Companies