VoiceVision RAG – Integrating Visual Document Intelligence with Voice Response — Suman Debnath, AWS

Coding & Dev Tools5 months ago

VoiceVision RAG – Integrating Visual Document Intelligence with Voice Response — Suman Debnath, AWS

Descriptions:

Suman Debnath, Principal Machine Learning Advocate at AWS, leads a hands-on workshop at the AI Engineer conference exploring vision-based document retrieval paired with a voice-driven query interface. The session centers on ColPali, a model that generates token-level visual embeddings for entire document pages — preserving charts, diagrams, and layouts that traditional OCR-then-chunk pipelines discard. Retrieval is scored using a late interaction mechanism called MaxSim: each query token is compared against every stored page token, and the maximum similarity scores are summed per page to produce a final relevance ranking.

Debnath demonstrates this end-to-end using a science textbook as the document corpus, Qdrant running locally in Docker as the vector database, and a voice interface layered on top through Strands Agent — an open-source AWS agentic framework released approximately two weeks before the talk. The agent handles tool-calling and orchestration, connecting spoken queries to retrieved visual page segments. Attendees follow along using a public GitHub repository that includes both a clean notebook and a pre-filled version with expected outputs.

The workshop is targeted at engineers already familiar with standard RAG pipelines who want to extend retrieval to visually rich documents — financial reports, scientific papers, compliance documentation — where text extraction alone loses critical information. Qdrant is highlighted as one of the few vector databases natively supporting the MaxSim late interaction pattern required for ColPali-style retrieval.

📺 Source: AI Engineer · Published December 06, 2025
🏷️ Format: Hands On Build

1 Item

Channels

No Image Available

AI Engineer

Tags

AWS

Prev

AI Dev 25 x NYC | Tanveer Mittal, Utkarsh Lamba: Building with the Claude Agent SDK

AI Dev 25 x NYC | Tanveer Mittal, Utkarsh Lamba: Building with the Claude Agent SDK

Next

Shipmas Day 3: Change Any Scene In Your Favorite Movie With AI

Shipmas Day 3: Change Any Scene In Your Favorite Movie With AI

18 Related Posts

Related Posts

15:13

Coding & Dev Tools

Make the PERFECT Videos with Claude Code (Full Workflow)

23 hours ago

01:04:27

Coding & Dev Tools

Make your own event-sourced agent harness using stream processors — Jonas Templestein, Iterate

23 hours ago

24:11

Coding & Dev Tools

Building a Polymarket AI Trading Bot From Scratch

3 days ago

20:42

Coding & Dev Tools

A Piece of Pi: Embedding The OpenClaw Coding Agent In Your Product — Matthias Luebken, Tavon

4 days ago

08:28

Coding & Dev Tools

Qwen3-8B at 74 tok/s with RedHat DFlash Speculator on vLLM Locally

4 days ago

10:15

Coding & Dev Tools

Why Can’t We Build UIs Like Blizzard?

4 days ago