20 days of compute vs 7 hours: rethinking what state-of-the-art means — Bertrand Charpentier, Pruna

Foundation Models2 months ago

20 days of compute vs 7 hours: rethinking what state-of-the-art means — Bertrand Charpentier, Pruna

Descriptions:

Bertrand Charpentier, from the AI model optimization company Pruna, challenges the conventional shortcuts practitioners use to identify the best AI model for a given task. His central argument is that both public leaderboards and internal evaluation methods are systematically unreliable when applied naively, and that the gap between ‘state-of-the-art’ on a benchmark and ‘best for your use case’ is almost always larger than teams expect.

On the leaderboard side, Charpentier compares rankings from LM Arena (now Arena), Design Arena, and Artificial Analysis for image editing models and shows that the top-ranked model differs across all three, that relative rankings between the same model pairs flip depending on the platform, and that Elo score ranges vary so wildly between leaderboards that cross-platform comparisons are essentially meaningless. He also demonstrates that ChatGPT Image, often cited as the top image editor, is not consistently first place on any task-specific sub-leaderboard covering object removal, background replacement, or text editing.

For internal evaluation, Charpentier runs a live audience preference experiment using images from Stable Diffusion, Flux 1, and a Pruna-developed model to show that manual inspection is doubly biased: by individual aesthetic preferences and by the specific sample set chosen. He advocates for scaled human evaluation combined with task-specific automated metrics, and argues that teams should evaluate models under their own production conditions rather than relying on aggregated public scores.

📺 Source: AI Engineer · Published June 01, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

AI Engineer

Tags

Artificial Analysis LM Arena

Prev

Microsoft Says 86% Treat AI Output as a Starting Point. Your Resume Just Stopped Working.

Next

The BEST AI for 4K images. Free & fast

18 Related Posts

Related Posts

21:09

Foundation Models

Persona Engineering: A Field Guide to AI Synthetic Personas — Ishan Anand, InsightSciences.ai

1 day ago

21:39

Foundation Models

Serving 2 Million Models Without Melting: Scaling the Hugging Face Hub — Arek Borucki, Hugging Face

2 days ago

06:40

Foundation Models

AMD Releases First Ever AI model: Instella-MoE-16B-A3B-Think

2 days ago

24:01

Foundation Models

US AI Dominance Is Over: Here’s Why

3 days ago

17:31

Foundation Models

The Messy Reality of Scale: Synthetic Data and Pre-Training — Marah Abdin & Robert McHardy, poolside

4 days ago

23:13

Foundation Models

Evaling Video Slop — Maor Bril, Character.ai

5 days ago