When All Context Matters: Extended Cache Augmented Generation – Luis Romero-Sevilla, Orbis

Foundation Models6 days ago

When All Context Matters: Extended Cache Augmented Generation – Luis Romero-Sevilla, Orbis

Descriptions:

Luis Romero Sevilla, VP of AI at Orbis, introduces Extended Cache Augmented Generation (XCAG), a retrieval architecture designed for a specific but common hard case: a large document collection where every document is relevant to the query and the collection is replaced frequently. Standard RAG fails because it cannot retrieve all documents without overwhelming the context; GraphRAG fails because recomputing the knowledge graph on every data refresh is prohibitively expensive.

XCAG starts from Cache Augmented Generation (CAG) — loading documents into a large-context model’s KV cache — but extends it by distributing documents across multiple parallel context buckets rather than one. A supervisor model then interrogates each bucket, progressively building its understanding and issuing targeted follow-up questions to specific buckets when it finds relevant content. Because all caches load simultaneously, the architecture is substantially faster than GraphRAG while returning more accurate answers than single-pass RAG for globally-relevant collections.

Romero Sevilla addresses cost concerns directly: KV cache is expensive, but cache lifetime optimization can reduce the bill, and the tradeoff is favorable compared to GraphRAG’s repeated LLM-driven graph construction. The talk positions XCAG within a broader landscape of retrieval strategies — each with its own compute, cost, and speed tradeoffs — and argues that no single approach fits all scenarios. For teams dealing with dense, rapidly-updated document sets, XCAG offers a practical middle path between the extremes of full-graph construction and simple vector similarity search.

📺 Source: AI Engineer · Published June 28, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

AI Engineer

Tags

RAG

Prev

HERMES AGENT + Stripe Payments + NVIDIA Nemotron is INSANE!

Next

Run DeepSeek DSpark on Qwen3 Locally and Reproduce the Speedup

18 Related Posts

Related Posts

25:21

Foundation Models

Deepseek drops another HUGE breakthrough

21 hours ago

09:01

Foundation Models

NVIDIA’s Two-Tower Model Generates Text 2.4x Faster Without Losing Quality

2 days ago

07:27

Foundation Models

This New AI Model Changes Everything

3 days ago

14:10

Foundation Models

Your Agent Failed in Prod. Good Luck Reproducing It. – Tisha Chawla & Susheem Koul, Microsoft

5 days ago

30:38

Foundation Models

The Future Is Domain-Specific Agents – Justin Schroeder, StandardAgents

5 days ago

07:14

Foundation Models

Deterministic Infra for Non-Deterministic AI Agents – Nishant Gupta, Meta Superintelligence Labs

5 days ago