Multi-modal RAG
RAG over images, tables, and text together. Unify retrieval across all document types.
The Multi-modal Challenge
Traditional RAG systems only handle text. But real documents contain images, charts, tables, and diagrams that carry critical information. A financial report's chart might answer "What was Q3 revenue?" better than any text paragraph.
Multi-modal RAG retrieves across all content types, returning the most relevant piece regardless of modality.
// Text-only RAG (limited)
User: "Show me the system architecture"
RAG: "The system uses a microservices architecture..." (misses diagram)
// Multi-modal RAG
User: "Show me the system architecture"
RAG: [Returns architecture diagram from page 12] + description
Two Approaches to Multi-modal RAG
1. Separate Pipelines
Extract text, images, and tables separately. Use different embedding models for each. Merge results at query time.
- + Uses specialized models per modality
- + More control over each pipeline
- - Complex orchestration
- - Score normalization challenges
2. Vision-Language Models
Treat entire document pages as images. Embed with VLMs like ColPali. Query retrieves the most relevant page.
- + Single unified pipeline
- + Captures layout and visual context
- + No extraction errors
- - Higher compute cost
ColPali: Vision-Language Retrieval
ColPali (Colbert + PaliGemma) is a late-interaction vision-language model designed specifically for document retrieval. It embeds document pages as images and matches against text queries.
Unlike CLIP which produces a single vector, ColPali uses late interaction: it produces multiple patch embeddings per page and computes fine-grained similarity.
ColPali Retrieval Pipeline
# Multi-modal retrieval with ColPali approach
from transformers import AutoProcessor, AutoModel
import torch
# Load vision-language model
processor = AutoProcessor.from_pretrained("vidore/colpali-v1.2")
model = AutoModel.from_pretrained("vidore/colpali-v1.2")
# Embed document pages as images
def embed_page(image):
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
embeddings = model(**inputs).last_hidden_state
return embeddings.mean(dim=1)
# Query with text
def query_pages(query: str, page_embeddings):
query_inputs = processor(text=query, return_tensors="pt")
query_emb = model(**query_inputs).last_hidden_state.mean(dim=1)
scores = torch.matmul(query_emb, page_embeddings.T)
return scores.argsort(descending=True)ColPali "sees" the entire page - text, images, tables, layout, headers, footnotes. No information is lost in extraction. The model learns that a pie chart answers percentage questions and a table answers comparison questions.
Embedding Strategies for Mixed Content
Page-as-Image (ColPali Style)
Convert each document page to an image. Embed with a vision-language model. Works for any document type. Best when layout matters.
Caption-Then-Embed
Use a VLM (GPT-4V, Claude) to caption images and describe tables. Embed the captions with a text embedding model. Simple but lossy.
Unified Embedding Space
Use CLIP or SigLIP to embed both images and text into the same vector space. Query with text, retrieve images directly. Limited by CLIP's training distribution.
Hybrid: Text + Image Indexes
Maintain separate indexes for text chunks and images. Query both, combine results with learned weights or reciprocal rank fusion.
When to Use VLMs vs Separate Pipelines
| Scenario | Recommendation |
|---|---|
| PDF reports with charts and tables | VLM (ColPali) - layout matters |
| Photo library search | CLIP/SigLIP - natural images |
| Mixed text docs + product images | Hybrid - separate indexes |
| Scanned documents (OCR needed) | VLM - handles visual text |
| High volume, latency-critical | Separate - can optimize each |
Rule of Thumb
If your documents have significant visual structure (tables, diagrams, forms, charts), use VLM-based approaches. If you're primarily dealing with natural images + text, separate pipelines with CLIP may be simpler and faster.
Full Pipeline: Caption-Then-Embed
Here's a practical approach using GPT-4V to caption images, then embedding captions alongside text:
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import base64
client = OpenAI()
embedder = SentenceTransformer('BAAI/bge-large-en-v1.5')
def caption_image(image_path: str) -> str:
"""Use GPT-4V to generate a detailed caption"""
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail for search indexing. Include all text, numbers, and visual elements."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}}
]
}],
max_tokens=500
)
return response.choices[0].message.content
def build_multimodal_index(documents: list):
"""Build unified index from text chunks and image captions"""
all_content = []
for doc in documents:
if doc['type'] == 'text':
all_content.append({
'content': doc['text'],
'source': doc['source'],
'type': 'text'
})
elif doc['type'] == 'image':
caption = caption_image(doc['path'])
all_content.append({
'content': caption,
'source': doc['path'],
'type': 'image',
'original_path': doc['path']
})
# Embed all content uniformly
texts = [item['content'] for item in all_content]
embeddings = embedder.encode(texts, normalize_embeddings=True)
return all_content, embeddingsKey Takeaways
- 1
Real documents are multi-modal - Text-only RAG misses charts, diagrams, and tables that often contain key information.
- 2
ColPali treats pages as images - No extraction needed. The model "sees" layout, text, and visuals together.
- 3
Caption-then-embed is practical - Use GPT-4V to describe images, then embed captions with text embedders.
- 4
Match approach to content - VLMs for structured documents, CLIP for natural images, hybrid for mixed collections.