So… My AI App Has Been Lying to Users (And How I Fixed It)

So… My AI App Has Been Lying to Users (And How I Fixed It)

More

Descriptions:

Chris Raroque walks through the real-world accuracy crisis he faced with Amy, his AI-powered calorie tracking app, where incorrect nutritional data from an AI backend was driving user cancellations. The video is built around production data, not toy examples: a Romanian cereal was returned at 140 calories when the correct value was 409 calories, and similar errors were common for international and niche food items. Rather than ad-hoc prompt tweaking, Raroque introduces a structured eval system using BrainTrust with a mixed dataset of synthetic and user-reported foods, verified ground-truth nutritional values, and a combination of rule-based scorers and LLM-as-a-judge functions.

The core of the video is a series of head-to-head experiments. The baseline used Perplexity Sonar (a search-augmented model). Subsequent attempts swapped in Gemini 2.5 Flash as a reasoning layer over Perplexity Search, tested a more expensive multi-step chain-of-thought approach, and ultimately tried Exa as an alternative search provider. The Exa + Gemini Flash combination scored 75% accuracy versus 55% for the same model architecture using Perplexity Search—a 20-percentage-point gain from changing only the search provider—while also cutting latency from 8.6 seconds to 4.5 seconds at roughly the same cost (~1 cent per call).

Raroque emphasizes that search provider quality is often the overlooked variable in RAG-style pipelines, and that providers update their underlying data frequently enough to warrant re-testing every few months. The episode is one of the more methodologically rigorous public examinations of AI accuracy engineering for consumer applications.


📺 Source: Chris Raroque · Published April 07, 2026
🏷️ Format: Workflow Case Study

1 Item

Channels

1 Item

People