Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Interviews6 months ago

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Descriptions:

George Cameron and Micah Hill-Smith, co-founders of Artificial Analysis, join Latent Space to discuss how their independent AI benchmarking company has grown in two years from a free website into a 20-person organization serving enterprises and AI companies with subscription benchmark reports and custom private evaluations. The conversation covers their business model and their firm commitment to keeping public benchmarks independent — no company pays to appear in their rankings.

On the technical side, Cameron and Hill-Smith walk through several of their evaluations, including a newly launched hallucination rate benchmark and a hard physics benchmark where the top-performing model scores just 9%. They share a notable cross-lab finding: Gemini 3 Pro showed a large accuracy improvement over prior Gemini models but essentially no change in hallucination rate, in contrast to Claude model families — a pattern they attribute to different post-training recipes across labs rather than any difference in underlying capability.

The discussion also explores methodological challenges in independent AI evaluation: how to get the industry to converge on shared metrics when every lab publishes self-reported numbers on their own system cards, why public benchmarks now saturate within months of release, and when hallucination is actually desirable — such as physics researchers using high-temperature sampling to explore novel hypotheses. For developers and enterprise teams navigating model selection, the episode offers both a portrait of the most systematic independent LLM evaluation operation and a substantive look at the current state of AI measurement.

📺 Source: Latent Space · Published January 09, 2026
🏷️ Format: Interview

Tags

Anthropic Artificial Analysis Blackwell Claude Opus Google GPT-4 Nvidia

Prev

Advanced LTX2: Auto-Prompt Generation & NVFP4 Acceleration Physics Glitches & Multilingual Voice

Advanced LTX2: Auto-Prompt Generation & NVFP4 Acceleration Physics Glitches & Multilingual Voice

Next

OpenAI, Google, and Anthropic Agree on One Thing (Finally) – This Week's Biggest AI Stories

OpenAI, Google, and Anthropic Agree on One Thing (Finally) – This Week's Biggest AI Stories

18 Related Posts

Related Posts

44:07

Interviews

Tesla Deliveries Jump 25% | Bloomberg Tech 7/02/2026

2 days ago

05:14

Interviews

Nuclear Reactor Powers Nvidia AI Chip in US First

2 days ago

07:36

Interviews

Microsoft Shifts Strategy on Enterprise AI

2 days ago

02:00:20

Interviews

Claude Fable 5 Is BACK (And It’s Different)

2 days ago

01:18:07

Interviews

Coinbase Cuts AI Spend by 50% | Kalshi’s $40B Valuation & Impending IPO | The Year for SaaS Roll-Ups

2 days ago

01:50:22

Interviews

Sonnet 5 Drops, Fable 5 Will Return & Fusion’s First Plant Gets Licensed W/ Philip Johnston | #268

3 days ago