Descriptions:
Fahd Mirza demonstrates a hands-on installation of Nvidia’s Nemotron ColEmbed V2, a multimodal embedding model capable of searching through images, PDFs, screenshots, charts, and infographics using plain-text natural language queries. The model ranks third on a leading visual document retrieval benchmark and is designed as a foundation for multimodal RAG pipelines in enterprise environments where document collections include rich visual content alongside text.
ColEmbed V2 is built on Qwen3-VL (4.8 billion parameters) and uses a three-part architecture: a SigLIP 2 vision encoder, an MLP vision-language merger, and an LLM backbone. Its key technical differentiator is a ColBERT-style late interaction approach — rather than collapsing an entire image into a single embedding vector, it generates multiple vectors across distinct image regions, enabling fine-grained matching between specific query terms and specific visual patches within a document. Version 2 adds advanced model merging (combining multiple fine-tuned checkpoints for ensemble-like accuracy without inference speed loss) and enhanced multilingual synthetic training data.
Mirza runs the installation on Ubuntu with an RTX 6000 GPU (48GB VRAM), deploying the 4-billion parameter variant at approximately 10GB on disk. The live demo uses three AI-generated medical images — dermatology, histopathology, and ophthalmology — paired with six text-based diagnostic queries. The model computes 2,560-dimensional similarity scores using ColBERT-style late interaction and correctly matches each query to the appropriate image, illustrating the practical use case of searching large medical or enterprise image repositories with natural language alone.
📺 Source: Fahd Mirza · Published February 28, 2026
🏷️ Format: Tutorial Demo







