Frontier results, on device – RL Nabors, Arize

Benchmarks5 days ago

Frontier results, on device – RL Nabors, Arize

Descriptions:

Most of use reach for a frontier model by default and pay for it on every call, in latency, in energy, in cash, and in everything that leaves their stack. For most of those calls, a small local model would do the job.

RL Nabors, former Meta/React core team member and AWS alum, covers the vocabulary you need to reason about model performance (capability evals, golden datasets, LLM-as-judge) and walks through real cases: a local agentic harness replacing a frontier call, an in-browser moderation classifier defended with production-trace evals, and a generative summarization feature where the rubric turns out to be harder than the model. You’ll leave with a framework for deciding when to choose large and off-prem or small and local models, and how to measure your way to the answer instead of guessing.

You will learn:

– The vocabulary to reason about model performance (capability evals, golden datasets, LLM-as-judge).
– A framework for deciding when a small or local model can replace a frontier one and when it can’t.
– A repeatable process for building capability evals from your own production traces, not someone else’s benchmark.
– Working examples of using eval results to iterate on prompts and ship with confidence instead of vibes.

Speakers:
– RL Nabors (Arize): RL Nabors builds developer tools and the communities that make them stick. Previously React and MDN, currently developer experience at Arize, perpetually building Mima.
X/Twitter: https://x.com/rachelnabors
LinkedIn: https://linkedin.com/in/nearestnabors
GitHub: https://linkedin.com/in/nearestnabors

1 Item

Channels

No Image Available

AI Engineer

1 Item

Companies

No Image Available

Arize AI

Tags

Arize AI Chrome Claude Opus Gemma 4 E2B Google Meta Nvidia

Prev

OpenClaw in Your Hand: Building a Physical AI Terminal – Lech Kalinowski, Callstack

Next

LongCat-2.0: China Breaks Free From Nvidia to Train a 1.6T Model

18 Related Posts

Related Posts

08:18

Benchmarks

Qwopus 35B + MTP: The Coder That Fixes Its Own Bugs at 160 tok/s

3 days ago

25:57

Benchmarks

I benchmarked the NEW Sonnet 5. The results shocked me.

4 days ago

13:57

Benchmarks

Can Krea 2 Turbo Really Make Great Images in 8 Steps? ComfyUI Test

5 days ago

14:08

Benchmarks

Qwythos 9B: When You Train a Small Model on Claude Traces: Run Locally

7 days ago

09:36

Benchmarks

Qwen3.6 (REAP 90pct GGUF): The Brain-Damaged Model

2 weeks ago

18:17

Benchmarks

VibeThinker 3B – Taking on Giant Models

2 weeks ago