Descriptions:
Suman Debnath, Principal Machine Learning Advocate at AWS, leads a hands-on workshop at the AI Engineer conference exploring vision-based document retrieval paired with a voice-driven query interface. The session centers on ColPali, a model that generates token-level visual embeddings for entire document pages — preserving charts, diagrams, and layouts that traditional OCR-then-chunk pipelines discard. Retrieval is scored using a late interaction mechanism called MaxSim: each query token is compared against every stored page token, and the maximum similarity scores are summed per page to produce a final relevance ranking.
Debnath demonstrates this end-to-end using a science textbook as the document corpus, Qdrant running locally in Docker as the vector database, and a voice interface layered on top through Strands Agent — an open-source AWS agentic framework released approximately two weeks before the talk. The agent handles tool-calling and orchestration, connecting spoken queries to retrieved visual page segments. Attendees follow along using a public GitHub repository that includes both a clean notebook and a pre-filled version with expected outputs.
The workshop is targeted at engineers already familiar with standard RAG pipelines who want to extend retrieval to visually rich documents — financial reports, scientific papers, compliance documentation — where text extraction alone loses critical information. Qdrant is highlighted as one of the few vector databases natively supporting the MaxSim late interaction pattern required for ColPali-style retrieval.
📺 Source: AI Engineer · Published December 06, 2025
🏷️ Format: Hands On Build







