So… My AI App Has Been Lying to Users (And How I Fixed It)

Agents & Automation1 month ago

So… My AI App Has Been Lying to Users (And How I Fixed It)

Descriptions:

Chris Raroque walks through the real-world accuracy crisis he faced with Amy, his AI-powered calorie tracking app, where incorrect nutritional data from an AI backend was driving user cancellations. The video is built around production data, not toy examples: a Romanian cereal was returned at 140 calories when the correct value was 409 calories, and similar errors were common for international and niche food items. Rather than ad-hoc prompt tweaking, Raroque introduces a structured eval system using BrainTrust with a mixed dataset of synthetic and user-reported foods, verified ground-truth nutritional values, and a combination of rule-based scorers and LLM-as-a-judge functions.

The core of the video is a series of head-to-head experiments. The baseline used Perplexity Sonar (a search-augmented model). Subsequent attempts swapped in Gemini 2.5 Flash as a reasoning layer over Perplexity Search, tested a more expensive multi-step chain-of-thought approach, and ultimately tried Exa as an alternative search provider. The Exa + Gemini Flash combination scored 75% accuracy versus 55% for the same model architecture using Perplexity Search—a 20-percentage-point gain from changing only the search provider—while also cutting latency from 8.6 seconds to 4.5 seconds at roughly the same cost (~1 cent per call).

Raroque emphasizes that search provider quality is often the overlooked variable in RAG-style pipelines, and that providers update their underlying data frequently enough to warrant re-testing every few months. The episode is one of the more methodologically rigorous public examinations of AI accuracy engineering for consumer applications.

📺 Source: Chris Raroque · Published April 07, 2026
🏷️ Format: Workflow Case Study

1 Item

Channels

No Image Available

Chris Raroque

1 Item

People

No Image Available

Chris Raroque

Tags

Amy BrainTrust Chris Raroque Claude Code Gemini 3 Flash

Prev

Claude Just Changed the Stock Market Forever! (Tutorial)

Next

Claude Mythos: Highlights from 244-page Release

Claude Mythos: Highlights from 244-page Release

18 Related Posts

Related Posts

17:50

Agents & Automation

Agents Don’t Do Standups: Building the Post-Engineer Engineering Org — Mike Spitz, PFF

1 hour ago

21:49

Agents & Automation

How Building with AI Can Double the Throughput of Your Engineering Team — Brian Scanlan, Intercom

1 hour ago

17:06

Agents & Automation

I’ve added a few things to my AI coding workflow

1 day ago

43:11

Agents & Automation

Local Hermes & Openclaw on Beelink in 43 mins

2 days ago

18:22

Agents & Automation

Building a Chess Coach — Anant Dole and Asbjorn Steinskog, Take Take Take

2 days ago

47:55

Agents & Automation

The $1M+ Solo AI Agent Business (Full Course)

3 days ago