Benchmarks - Frontier Models

There are 70 items in this page

08:18

Benchmarks3 days ago

Qwopus 35B + MTP: The Coder That Fixes Its Own Bugs at 160 tok/s

Fahd Mirza tests Qwopus Coder, a 35-billion-parameter mixture-of-experts coding model built on the Qwen 3.6 architecture (3B paramete...

25:57

Benchmarks4 days ago

I benchmarked the NEW Sonnet 5. The results shocked me.

How I AI introduces the Howi AI Bench — a repeatable, multi-dimensional evaluation framework built with Claude Code — and runs Claude...

30:52

Benchmarks5 days ago

Frontier results, on device – RL Nabors, Arize

Rachel Lee Nabors — formerly at Mozilla on Firefox DevTools, the W3C, Microsoft Edge, and the React team, now at Arize — presents a p...

13:57

Benchmarks5 days ago

Can Krea 2 Turbo Really Make Great Images in 8 Steps? ComfyUI Test

Veteran AI runs a structured eight-category evaluation of Krea 2 Turbo — the eight-step distilled image generation model released by...

14:08

Benchmarks7 days ago

Qwythos 9B: When You Train a Small Model on Claude Traces: Run Locally

Fahd Mirza introduces and benchmarks Qwythos 9B, a reasoning-focused open-source model fine-tuned on over 500 million tokens of Claud...

09:36

Benchmarks2 weeks ago

Qwen3.6 (REAP 90pct GGUF): The Brain-Damaged Model

Fahd Mirza takes a deep look at an aggressively pruned variant of Qwen 3.6 — a 35-billion-parameter mixture-of-experts model — compre...

18:17

Benchmarks2 weeks ago

VibeThinker 3B – Taking on Giant Models

Sam Witteveen digs into VibeThinker 3B, a small language model from Waybo AI Lab — the AI research arm of the Chinese social network...

08:20

Benchmarks2 weeks ago

LoopCoder – The 7B Model That Thinks Twice – Does it Beat Others?

LoopCoder V2 is a 7-billion-parameter open-source code model built on an unusual architectural idea: instead of stacking more transfo...

09:40

Benchmarks3 weeks ago

DFlash Just Got Faster: 4x Speed with 160 tok/s Locally

Fahd Mirza benchmarks DFlash with SGLang's new SpecV2 overlapping scheduler on an NVIDIA H100 80GB GPU, demonstrating a 4.3x throughp...

31:25

Benchmarks3 weeks ago

Claude Fable 5 BANNED: The First Model Agentic Engineers DON’T NEED

IndyDevDan covers two intertwined stories in this video: the sudden federal suspension of Claude Fable 5 and Mythos 5, and a detailed...

09:03

Benchmarks3 weeks ago

I tried to prove AI trading is BS and it backfired

The Algovibes channel set out to definitively disprove AI-powered crypto trading — and ended up with results more interesting than ex...

09:13

Benchmarks4 weeks ago

I Tested 100,000 Trading Strategies on 1,000 Stocks

Algovibes presents a large-scale systematic backtesting study covering the full Russell 1000 universe — 1,014 stocks, 66 technical tr...

14:35

Benchmarks4 weeks ago

Google QAT vs Unsloth Q4_0 – Which Gemma 4 12B Quantization Is Better?

Fahd Mirza runs a controlled comparison between two 4-bit quantized versions of Google's Gemma 4 12B model: Google's own QAT (quantiz...

12:30

Benchmarks4 weeks ago

Ideogram 4: World’s Best Text-to-Image Model? Let’s Test Locally

Fahd Mirza installs and tests Ideogram 4 locally, providing a candid assessment of its real-world hardware requirements and architect...

14:55

Benchmarks4 weeks ago

Gemma 4 12B on a 16GB Mac Mini Is Surprisingly Capable

Bart Slodyczka puts Google's newly released Gemma 4 12B model through its paces on a 16GB M4 Mac Mini — a practical test of what entr...

16:08

Benchmarks1 month ago

Benchmarking semantic code retrieval on Claude Code — Kuba Rogut, Turbopuffer

Kuba Rogut of Turbopuffer presents original benchmark results comparing three code retrieval strategies for Claude Code: the default...