20 days of compute vs 7 hours: rethinking what state-of-the-art means — Bertrand Charpentier, Pruna

20 days of compute vs 7 hours: rethinking what state-of-the-art means — Bertrand Charpentier, Pruna

More

Descriptions:

Bertrand Charpentier, from the AI model optimization company Pruna, challenges the conventional shortcuts practitioners use to identify the best AI model for a given task. His central argument is that both public leaderboards and internal evaluation methods are systematically unreliable when applied naively, and that the gap between ‘state-of-the-art’ on a benchmark and ‘best for your use case’ is almost always larger than teams expect.

On the leaderboard side, Charpentier compares rankings from LM Arena (now Arena), Design Arena, and Artificial Analysis for image editing models and shows that the top-ranked model differs across all three, that relative rankings between the same model pairs flip depending on the platform, and that Elo score ranges vary so wildly between leaderboards that cross-platform comparisons are essentially meaningless. He also demonstrates that ChatGPT Image, often cited as the top image editor, is not consistently first place on any task-specific sub-leaderboard covering object removal, background replacement, or text editing.

For internal evaluation, Charpentier runs a live audience preference experiment using images from Stable Diffusion, Flux 1, and a Pruna-developed model to show that manual inspection is doubly biased: by individual aesthetic preferences and by the specific sample set chosen. He advocates for scaled human evaluation combined with task-specific automated metrics, and argues that teams should evaluate models under their own production conditions rather than relying on aggregated public scores.


📺 Source: AI Engineer · Published June 01, 2026
🏷️ Format: Deep Dive

1 Item

Channels