Level 4: Advanced~30 min

Multi-modal RAG

RAG over images, tables, and text together. Unify retrieval across all document types.

The Multi-modal Challenge

Traditional RAG systems only handle text. But real documents contain images, charts, tables, and diagrams that carry critical information. A financial report's chart might answer "What was Q3 revenue?" better than any text paragraph.

Multi-modal RAG retrieves across all content types, returning the most relevant piece regardless of modality.

// Text-only RAG (limited)

User: "Show me the system architecture"

RAG: "The system uses a microservices architecture..." (misses diagram)

// Multi-modal RAG

User: "Show me the system architecture"

RAG: [Returns architecture diagram from page 12] + description

Two Approaches to Multi-modal RAG

1. Separate Pipelines

Extract text, images, and tables separately. Use different embedding models for each. Merge results at query time.

  • + Uses specialized models per modality
  • + More control over each pipeline
  • - Complex orchestration
  • - Score normalization challenges

2. Vision-Language Models

Treat entire document pages as images. Embed with VLMs like ColPali. Query retrieves the most relevant page.

  • + Single unified pipeline
  • + Captures layout and visual context
  • + No extraction errors
  • - Higher compute cost

ColPali: Vision-Language Retrieval

ColPali (Colbert + PaliGemma) is a late-interaction vision-language model designed specifically for document retrieval. It embeds document pages as images and matches against text queries.

Unlike CLIP which produces a single vector, ColPali uses late interaction: it produces multiple patch embeddings per page and computes fine-grained similarity.

ColPali Retrieval Pipeline

# Multi-modal retrieval with ColPali approach
from transformers import AutoProcessor, AutoModel
import torch

# Load vision-language model
processor = AutoProcessor.from_pretrained("vidore/colpali-v1.2")
model = AutoModel.from_pretrained("vidore/colpali-v1.2")

# Embed document pages as images
def embed_page(image):
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        embeddings = model(**inputs).last_hidden_state
    return embeddings.mean(dim=1)

# Query with text
def query_pages(query: str, page_embeddings):
    query_inputs = processor(text=query, return_tensors="pt")
    query_emb = model(**query_inputs).last_hidden_state.mean(dim=1)
    scores = torch.matmul(query_emb, page_embeddings.T)
    return scores.argsort(descending=True)
Key insight:

ColPali "sees" the entire page - text, images, tables, layout, headers, footnotes. No information is lost in extraction. The model learns that a pie chart answers percentage questions and a table answers comparison questions.

Embedding Strategies for Mixed Content

1

Page-as-Image (ColPali Style)

Convert each document page to an image. Embed with a vision-language model. Works for any document type. Best when layout matters.

2

Caption-Then-Embed

Use a VLM (GPT-4V, Claude) to caption images and describe tables. Embed the captions with a text embedding model. Simple but lossy.

3

Unified Embedding Space

Use CLIP or SigLIP to embed both images and text into the same vector space. Query with text, retrieve images directly. Limited by CLIP's training distribution.

4

Hybrid: Text + Image Indexes

Maintain separate indexes for text chunks and images. Query both, combine results with learned weights or reciprocal rank fusion.

When to Use VLMs vs Separate Pipelines

ScenarioRecommendation
PDF reports with charts and tablesVLM (ColPali) - layout matters
Photo library searchCLIP/SigLIP - natural images
Mixed text docs + product imagesHybrid - separate indexes
Scanned documents (OCR needed)VLM - handles visual text
High volume, latency-criticalSeparate - can optimize each

Rule of Thumb

If your documents have significant visual structure (tables, diagrams, forms, charts), use VLM-based approaches. If you're primarily dealing with natural images + text, separate pipelines with CLIP may be simpler and faster.

Full Pipeline: Caption-Then-Embed

Here's a practical approach using GPT-4V to caption images, then embedding captions alongside text:

from openai import OpenAI
from sentence_transformers import SentenceTransformer
import base64

client = OpenAI()
embedder = SentenceTransformer('BAAI/bge-large-en-v1.5')

def caption_image(image_path: str) -> str:
    """Use GPT-4V to generate a detailed caption"""
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image in detail for search indexing. Include all text, numbers, and visual elements."},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}}
            ]
        }],
        max_tokens=500
    )
    return response.choices[0].message.content

def build_multimodal_index(documents: list):
    """Build unified index from text chunks and image captions"""
    all_content = []

    for doc in documents:
        if doc['type'] == 'text':
            all_content.append({
                'content': doc['text'],
                'source': doc['source'],
                'type': 'text'
            })
        elif doc['type'] == 'image':
            caption = caption_image(doc['path'])
            all_content.append({
                'content': caption,
                'source': doc['path'],
                'type': 'image',
                'original_path': doc['path']
            })

    # Embed all content uniformly
    texts = [item['content'] for item in all_content]
    embeddings = embedder.encode(texts, normalize_embeddings=True)

    return all_content, embeddings

Key Takeaways

  • 1

    Real documents are multi-modal - Text-only RAG misses charts, diagrams, and tables that often contain key information.

  • 2

    ColPali treats pages as images - No extraction needed. The model "sees" layout, text, and visuals together.

  • 3

    Caption-then-embed is practical - Use GPT-4V to describe images, then embed captions with text embedders.

  • 4

    Match approach to content - VLMs for structured documents, CLIP for natural images, hybrid for mixed collections.