AI EngineeringRAGMultimodalVector DatabaseComputer Vision

Multimodal RAG: Searching Images with Text

A
AI Research Engineer
Featured Guide 22 min read

"Find the screenshot of the billing error."

Standard RAG (Retrieval Augmented Generation) only sees text. If your knowledge base is full of screenshots, PDFs with charts, or architectural diagrams, it's blind.

Multimodal RAG solves this by using models like CLIP to understand that an image of a cat and the word "feline" are mathematically identical.

02. Joint Embedding Space

We use a multimodal embedding model. We pass images through a Vision Encoder and text through a Text Encoder. They project vectors into the same n-dimensional space.

Deep Dive: The Modal Gap

You cannot use OpenAI's text-embedding-3-small for text and a ResNet for images. The vectors would be in completely different coordinate systems.

You must use a model trained with Contrastive Learning (like CLIP or SigLIP) which forces "A photo of a dog" and the text "A photo of a dog" to be close in vector space.

// ingestion.ts
const
imgVector =
await
clip.embedImage('./chart.png');
await
vectorDB.upsert({'{'}
  id: 'chart-1',
  values: imgVector,
  metadata: {'{'} type: 'image' {'}'}
{'}'});

// search.ts
const
queryVector =
await
clip.embedText("revenue growth graph");
// Returns the image because vectors are close!

04. The Senior Engineer's Take

Cost vs Value

Multimodal embeddings are large (high dimensionality) and expensive to index.

Pro Tip: Don't embed every frame of a video. Use keyframe extraction (every 5s) to capture the semantic meaning without blowing up your vector storage bill.

ColBERT & Late Interaction

If standard cosine similarity isn't precise enough, look into ColBERT (Late Interaction). It keeps all token vectors rather than compressing them into one, allowing for much finer-grained matching at the cost of higher storage and compute.

Interactive Playground

import React, { useState } from 'react';

// 👁️ Multimodal Search Visualizer

export default function MultimodalDemo() {
    const [query, setQuery] = useState("");
    const [results, setResults] = useState([]);
    const [isSearching, setIsSearching] = useState(false);

    // Mock Database of "Vectors"
    const database = [
        { id: 1, type: 'image', content: 'cat_photo.jpg', vector: [0.9, 0.1], label: 'Photo of a sleeping Tabby cat' },
        { id: 2, type: 'image', content: 'dog_photo.jpg', vector: [0.1, 0.9], label: 'Golden Retriever running in park' },
        { id: 3, type: 'text', content: 'cat_care_guide.txt', vector: [0.85, 0.15], label: 'Guide: How to feed cats' },
        { id: 4, type: 'image', content: 'chart_sales.png', vector: [-0.5, -0.5], label: 'Q3 Revenue Bar Chart' },
    ];

    const performSearch = (text) => {
        setQuery(text);
        if (!text) {
            setResults([]);
            return;
        }
        
        setIsSearching(true);
        setTimeout(() => {
            // Simulated Vector Similarity (Basic Keyword match for demo, but pretend it's Math)
            const terms = text.toLowerCase().split(' ');
            const matched = database.map(item => {
                let score = 0;
                if (item.label.toLowerCase().includes(terms[0])) score += 0.9; // Strong match
                // Add noise
                score += Math.random() * 0.1;
                return { ...item, score };
            }).sort((a, b) => b.score - a.score).filter(i => i.score > 0.3);

            setResults(matched);
            setIsSearching(false);
        }, 800);
    };

    return (
        <div className="bg-slate-50 dark:bg-slate-950 p-8 rounded-3xl border border-slate-200 dark:border-slate-800 shadow-xl h-[600px] flex flex-col">
             <div className="flex justify-between items-center mb-6">
                <h3 className="text-2xl font-black text-gray-900 dark:text-white flex items-center gap-3">
                    <span className="text-orange-500">👁️</span> Vector Search
                </h3>
            </div>

            {/* Input */}
            <div className="relative mb-8 z-20">
                <div className="absolute inset-y-0 left-0 pl-4 flex items-center pointer-events-none">
                    <span className="text-gray-400">🔍</span>
                </div>
                <input 
                    type="text"
                    placeholder="Search for 'chart', 'dog', or 'cat'..."
                    className="w-full pl-12 pr-4 py-4 rounded-xl bg-white dark:bg-slate-900 border border-slate-200 dark:border-slate-800 focus:ring-2 focus:ring-orange-500 outline-none shadow-lg text-lg"
                    value={query}
                    onChange={(e) => performSearch(e.target.value)}
                />
            </div>

            {/* Visualization Space */}
            <div className="flex-1 bg-white dark:bg-black rounded-2xl border border-slate-200 dark:border-slate-800 p-8 relative overflow-hidden">
                <div className="absolute top-4 right-4 text-xs font-mono text-gray-400 z-10">
                    2D Projecton of Latent Space
                </div>
                
                {/* The "Space" */}
                <div className="w-full h-full relative flex items-center justify-center">
                    
                    {/* Grid Lines */}
                    <div className="absolute inset-0 opacity-10" style={{ backgroundImage: 'linear-gradient(#ccc 1px, transparent 1px), linear-gradient(90deg, #ccc 1px, transparent 1px)', backgroundSize: '40px 40px' }}></div>

                    {/* Database Items (Distributed in Space) */}
                    {database.map((item) => {
                        // Hardcoded positions for visual demo
                        const positions = {
                            1: { top: '30%', left: '70%' }, // Cat
                            2: { top: '30%', left: '30%' }, // Dog
                            3: { top: '40%', left: '75%' }, // Cat Text (Close to cat img)
                            4: { top: '80%', left: '50%' }, // Chart
                        };
                        const pos = positions[item.id];
                        const isMatch = results.find(r => r.id === item.id);

                        return (
                            <div 
                                key={item.id}
                                className={`absolute p-3 rounded-xl border-2 transition-all duration-500 flex flex-col items-center gap-2 cursor-pointer ${
                                    isMatch 
                                    ? 'border-orange-500 bg-orange-50 dark:bg-orange-900/30 scale-110 z-20 shadow-xl shadow-orange-500/20' 
                                    : 'border-slate-200 dark:border-slate-800 bg-slate-50 dark:bg-slate-900 scale-100 opacity-60 hover:opacity-100 grayscale hover:grayscale-0'
                                }`}
                                style={{ top: pos.top, left: pos.left, transform: 'translate(-50%, -50%)' }}
                            >
                                <div className={`w-12 h-12 rounded-lg flex items-center justify-center ${isMatch ? 'bg-orange-100 dark:bg-orange-800' : 'bg-gray-200 dark:bg-slate-800'}`}>
                                    {item.type === 'image' ? <span className="text-2xl">🖼️</span> : <span className="text-2xl">📄</span>}
                                </div>
                                <div className="text-[10px] font-bold max-w-[80px] text-center truncate px-1 rounded bg-white/50 dark:bg-black/50">
                                    {item.label}
                                </div>
                                {isMatch && (
                                    <div className="absolute -top-3 -right-3 bg-orange-600 text-white text-[10px] px-2 py-0.5 rounded-full font-bold shadow-sm">
                                        {(item.score * 100).toFixed(0)}%
                                    </div>
                                )}
                            </div>
                        );
                    })}

                    {/* Search Vector Center */}
                    {query && (
                         <div className="absolute top-[35%] left-[50%] -translate-x-1/2 -translate-y-1/2 animate-in fade-in zoom-in duration-300">
                             <div className="w-24 h-24 border-4 border-dashed border-orange-500 rounded-full animate-spin-slow opacity-20 absolute"></div>
                             <div className="w-4 h-4 bg-orange-500 rounded-full shadow-[0_0_20px_rgba(249,115,22,1)] relative z-30"></div>
                             <div className="absolute top-6 left-1/2 -translate-x-1/2 whitespace-nowrap bg-black text-white text-[10px] px-2 py-1 rounded">Avg Query Vector</div>
                         </div>
                    )}

                </div>
            </div>
        </div>
    );
}