Multi-modal RAG
From CLIP to ColPali: the history, architecture, and practice of retrieval across images, tables, and text.
The Problem Text-Only RAG Cannot Solve
Traditional RAG systems treat documents as flat text. They split PDFs into chunks, embed the chunks, and retrieve the closest matches to a query. This works beautifully when documents are pure prose. But real-world documents are not pure prose.
A financial report's revenue breakdown lives in a bar chart, not in a sentence. An engineering spec's system design is in a diagram, not in paragraph three. A medical paper's key result is in Table 2, not in the abstract. Text-only RAG cannot retrieve what it cannot see.
Multi-modal RAG is the field's answer: retrieval systems that operate over images, tables, charts, diagrams, and text simultaneously, returning the most relevant piece regardless of modality. The journey to get here spans a decade of converging work in vision-language modeling, document understanding, and retrieval architecture.
// Text-only RAG (blind to visual content)
User: "What was Q3 revenue by segment?"
RAG: "Revenue grew 12% year-over-year..." (misses the breakdown chart on page 7)
// Multi-modal RAG (sees everything)
User: "What was Q3 revenue by segment?"
RAG: [Returns stacked bar chart from page 7] + "Enterprise: $4.2B, Consumer: $1.8B..."
A Decade of Multimodal Retrieval: 2013 to 2025
Understanding the history explains why certain architectures won and others didn't, and why ColPali's approach was genuinely surprising to the community.
DeViSE: Deep Visual-Semantic Embeddings
Andrea Frome et al. at Google trained a system to map images into the same vector space as Word2Vec word embeddings. A photo of a cat would land near the word "cat" in embedding space. The architecture was simple: a CNN image encoder trained with a hinge rank loss to push image vectors toward their label's word vector and away from random negatives.
DeViSE demonstrated that cross-modal alignment was possible — you could query with text and retrieve images — but it was limited to single-word labels. It could not handle "a cat sitting on a blue mat" because Word2Vec had no sentence representations.
— Frome, A. et al. (2013). DeViSE: A Deep Visual-Semantic Embedding Model. NeurIPS.
VSE++: Visual Semantic Embeddings with Hard Negatives
Faghri, Fleet, Kiros, and Fidler improved on DeViSE by aligning images with full sentences (captions) rather than single words. The key contribution was using hard negative mining — training on the most confusing negative examples rather than random ones. This dramatically improved retrieval precision and became standard practice in all subsequent contrastive vision-language models.
— Faghri, F. et al. (2018). VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. BMVC.
CLIP: The Scaling Breakthrough
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh et al. at OpenAI scaled the dual-encoder approach to 400 million image-text pairs scraped from the internet (WebImageText). CLIP trained a Vision Transformer (ViT) and a text transformer jointly with a contrastive loss: matching pairs should be close, non-matching pairs should be far apart.
The results stunned the field. CLIP achieved competitive zero-shot classification on ImageNet without ever seeing an ImageNet training example. It could match arbitrary text descriptions to images with remarkable accuracy. The secret was scale: 400M diverse pairs taught the model generalizable visual-semantic alignment that no curated dataset could match.
# CLIP architecture (simplified)
image_encoder = ViT_L14() # Vision Transformer
text_encoder = TransformerGPT() # 63M-param text transformer
# Contrastive training on 400M pairs
for images, texts in dataloader:
img_emb = image_encoder(images) # (batch, 768)
txt_emb = text_encoder(texts) # (batch, 768)
# Symmetric cross-entropy loss on cosine similarity matrix
logits = img_emb @ txt_emb.T * temperature
loss = (cross_entropy(logits, labels) + cross_entropy(logits.T, labels)) / 2— Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML. 15,000+ citations.
CLIP's limitation for document RAG
CLIP was trained on natural images with alt-text captions — photos, illustrations, memes. It was not trained on document pages, tables, charts, or forms. Asking CLIP to retrieve a financial table or an architecture diagram gives poor results because the model has no concept of structured document layout. This gap is precisely what motivated the document-understanding models that followed.
SigLIP: Sigmoid Loss for Efficient Scaling
Zhai, Mustafa, Kolesnikov, and Beyer at Google replaced CLIP's softmax contrastive loss with a per-pair sigmoid loss. This eliminated the need for global normalization across the batch, enabling training with larger batch sizes and across multiple TPU pods without synchronization. SigLIP matched CLIP quality at lower computational cost and became the backbone for Google's PaliGemma vision-language models — which in turn became the foundation for ColPali.
— Zhai, X. et al. (2023). Sigmoid Loss for Language Image Pre-Training. ICCV.
LayoutLM: Position-Aware Document Embeddings
Yiheng Xu et al. at Microsoft extended BERT with 2D positional embeddings — encoding each token's (x, y) bounding box on the page alongside its text content. For the first time, a model could distinguish between a heading at the top of a page and a footnote at the bottom, or understand that two text blocks in adjacent columns were separate content streams.
LayoutLMv2 (2021) added a visual backbone to process the page image directly, and LayoutLMv3 (2022) unified text, layout, and vision pre-training into a single model. These models excelled at form understanding, receipt parsing, and document classification — but they were designed for extraction tasks, not retrieval. Each document needed OCR and layout analysis as preprocessing.
— Xu, Y. et al. (2020). LayoutLM: Pre-training of Text and Layout. KDD.
— Huang, Y. et al. (2022). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. ACM MM.
Donut & Nougat: OCR-Free Document Understanding
Kim et al. at NAVER (Donut, 2022) and Blecher et al. at Meta (Nougat, 2023) demonstrated that you could skip OCR entirely. Feed a document page image into a vision encoder, decode it with a transformer, and directly produce structured text (Markdown, LaTeX). Nougat could convert academic papers — equations, tables, figures, and all — directly from page images to formatted Markdown.
This was a conceptual breakthrough: if a model could "read" a page as an image, maybe it could also retrieve pages as images. The extraction step that LayoutLM required was now optional.
— Kim, G. et al. (2022). OCR-free Document Understanding Transformer. ECCV.
— Blecher, L. et al. (2023). Nougat: Neural Optical Understanding for Academic Documents. arXiv.
ColPali: The Page-as-Image Revolution
Manuel Faysse, Hugues Sibille, Tony Wu et al. at ILLUIN Technology and Sorbonne asked a radical question: what if we skip the entire document processing pipeline — no OCR, no layout detection, no chunking, no table extraction — and just embed document pages as images?
ColPali combined Google's PaliGemma vision-language model (built on SigLIP + Gemma) with the ColBERT late interaction mechanism. Each document page produces a grid of patch embeddings (one per image patch), and each query produces token embeddings. Retrieval uses MaxSim: for each query token, find its maximum similarity across all page patches, then sum these scores.
"We show that by simply treating pages as images and leveraging VLMs, we can bypass complex and brittle layout detection and OCR pipelines, while achieving state-of-the-art performance on document retrieval benchmarks."
— Faysse, M. et al. (2024). ColPali: Efficient Document Retrieval with Vision Language Models. arXiv.
ColPali beat all existing document retrieval systems on the ViDoRe benchmark — including systems with elaborate OCR + chunking + reranking pipelines. The result was almost paradoxical: the simplest possible approach (just look at the page) produced the best results.
The ColPali Family Expands
ColPali sparked a wave of follow-up models, each swapping the VLM backbone while keeping the late-interaction retrieval mechanism:
ColQwen2
Replaces PaliGemma with Qwen2-VL. Better multilingual support and dynamic resolution handling.
ColSmol
Uses SmolVLM (2B params). 80% of ColPali quality at a fraction of the compute cost.
ColPali v1.3
Refined training recipe with BiPali synthetic data generation. New SOTA on ViDoRe.
DSE (Document Screenshot Embedding)
Microsoft. Single-vector approach (no late interaction) for cheaper storage.
The throughline: 2013 to 2025
Each generation solved a specific limitation of the last:
Two Fundamental Approaches
Every multimodal RAG system is a variant of one of two architectures. Understanding the trade-offs between them is the single most important design decision you will make.
1. Extract-Then-Embed
Parse documents into modality-specific components (text chunks, extracted tables, captioned images). Embed each component separately. Query each index and merge results.
- + Uses specialized models per modality
- + Text embeddings are cheap and fast
- + Established tooling (Unstructured, LlamaParse)
- - OCR errors propagate through the pipeline
- - Layout information is destroyed during extraction
- - Score normalization across indexes is fragile
2. Page-as-Image (Vision-Language)
Render each document page as an image. Embed with a vision-language model. Query with text. Retrieve the most relevant page directly.
- + Zero preprocessing — any document format works
- + Layout, typography, and visual context preserved
- + No OCR errors — the model "reads" the image
- + Single unified pipeline, less code to maintain
- - Higher GPU compute at indexing time
- - Larger embedding storage (multi-vector per page)
ColPali Deep Dive: How It Works
ColPali is not a single model — it is an architecture pattern that combines two independent ideas: (1) a vision-language model that can "see" document pages, and (2) a late interaction retrieval mechanism that enables fine-grained matching. Understanding each component separately is essential.
Component 1: PaliGemma (The Vision-Language Backbone)
PaliGemma is Google's lightweight VLM combining SigLIP (a vision encoder) with Gemma (a language model). The SigLIP encoder splits each page image into a grid of patches (typically 16x16 pixels each) and encodes each patch into a vector. These patch vectors are then projected into Gemma's token space and processed alongside any text tokens.
# PaliGemma processing a document page
page_image # 1024x1024 pixels
patches = split_into_patches(page_image, size=16) # 4096 patches
patch_vectors = siglip_encoder(patches) # (4096, 1152)
projected = linear_projection(patch_vectors) # (4096, 2048) → Gemma space
# Each patch "knows" about its 16x16 pixel region of the pageThe critical insight: each patch embedding carries information about a specific spatial region of the document. A patch covering a table cell, a chart bar, or a heading word retains that positional context. This is why ColPali can answer "what's in the table in the top right?"
Component 2: ColBERT Late Interaction
Standard dual encoders (CLIP, single-vector models) compress an entire document into one vector. This creates an information bottleneck — a 1024-page document and a 10-word query both become a single 768-dim vector. ColBERT's late interaction avoids this by keeping all token/patch vectors and computing fine-grained token-to-patch scores.
# Late interaction scoring (MaxSim)
def maxsim_score(query_tokens, page_patches):
"""
For each query token, find its maximum similarity
across ALL page patches. Then sum the maximums.
"""
score = 0
for q_token in query_tokens: # e.g., 12 tokens
max_sim = -inf
for p_patch in page_patches: # e.g., 1030 patches
sim = cosine_similarity(q_token, p_patch)
max_sim = max(max_sim, sim)
score += max_sim
return score
# Query: "Q3 revenue by segment" (5 tokens)
# "revenue" token matches strongly with the bar chart patches
# "segment" token matches with the legend patches
# "Q3" token matches with the axis label patches
# Total score is high → this page is relevantLate interaction is more expensive than a single dot product, but far cheaper than running a cross-encoder over every page. The patch embeddings can be precomputed and stored, so retrieval is still fast — just a matrix multiplication per page rather than a scalar per page.
Component 3: Contrastive Training
ColPali is fine-tuned from PaliGemma on query-page pairs from document retrieval datasets. The training objective is straightforward: maximize the MaxSim score between a query and its relevant page, minimize it for irrelevant pages. Hard negatives are mined from the same document collection.
# ColPali training loop (simplified)
for queries, positive_pages, negative_pages in dataloader:
q_embs = model.encode_query(queries) # list of (n_tokens, 128)
pos_embs = model.encode_page(positive_pages) # list of (n_patches, 128)
neg_embs = model.encode_page(negative_pages)
pos_scores = maxsim(q_embs, pos_embs) # should be HIGH
neg_scores = maxsim(q_embs, neg_embs) # should be LOW
loss = softmax_cross_entropy(pos_scores, neg_scores)
loss.backward() # Gradients flow through VLM backboneWhy does page-as-image beat text extraction?
Three compounding reasons, each individually significant:
- No extraction errors. OCR misreads, table parsers misalign columns, figure extractors miss captions. Each error is a retrieval failure waiting to happen. ColPali has no extraction step to fail.
- Layout is information. A bold heading, a footnote in small type, a highlighted cell — these visual signals carry meaning that plain text discards. The VLM encodes layout implicitly through its patch structure.
- Cross-modal reasoning. When a chart's legend says "Revenue" and the bars show numerical values, a VLM can associate the query "revenue" with both the text label and the visual bars. A text-only system would need the chart parsed into a description first.
The Table Problem: Why Tables Break Text-Only RAG
Tables deserve special attention because they are the single most common failure mode in production RAG systems. A 2024 survey of enterprise RAG deployments found that table-related queries accounted for 40%+ of user complaints despite tables being a small fraction of total content.
The reason is fundamental: tables encode meaning through spatial relationships — the value "$4.2B" only means "Q3 Enterprise Revenue" because of its position at the intersection of the "Q3" column and the "Enterprise" row. Flatten the table to text and this spatial relationship is destroyed or degraded.
How text extraction mangles tables
# Original table (spatial meaning is clear):
# ┌─────────────┬────────┬────────┬────────┐
# │ Segment │ Q1 │ Q2 │ Q3 │
# ├─────────────┼────────┼────────┼────────┤
# │ Enterprise │ $3.8B │ $4.0B │ $4.2B │
# │ Consumer │ $1.5B │ $1.6B │ $1.8B │
# └─────────────┴────────┴────────┴────────┘
# After naive text extraction:
"Segment Q1 Q2 Q3 Enterprise $3.8B $4.0B $4.2B Consumer $1.5B $1.6B $1.8B"
# Which $4.2B belongs to which segment? Which quarter?
# The spatial meaning is gone.
# After "smart" table extraction (still lossy):
"Enterprise: Q1=$3.8B, Q2=$4.0B, Q3=$4.2B\nConsumer: Q1=$1.5B..."
# Better, but now the embedding model sees a flat string
# that doesn't match how users ask questions about tablesA VLM like ColPali processes the table as an image. The patch covering the "$4.2B" cell inherits spatial context from neighboring patches — the column header "Q3" above it and the row header "Enterprise" to its left. No extraction needed.
Strategy 1: Table-to-Markdown
Extract tables to Markdown format using LlamaParse or Unstructured. Embed the Markdown. Works for simple tables. Fails on merged cells, nested headers, and spanning rows.
Strategy 2: Table-to-NL
Use an LLM to convert each table into natural language sentences. "Enterprise revenue was $4.2B in Q3." Better for embedding but expensive and lossy for complex tables.
Strategy 3: Page-as-Image
Don't extract the table at all. Let ColPali see it as part of the page image. The spatial relationships are preserved implicitly. Currently the most reliable approach.
Four Embedding Strategies for Mixed Content
Page-as-Image with Late Interaction (ColPali)
Convert each document page to an image at 1024x1024. Embed with ColPali to produce ~1030 patch vectors per page. Query uses MaxSim for fine-grained matching.
Best for: PDF reports, scanned documents, any document where layout carries meaning.Storage: ~512KB per page (1030 x 128-dim float16 vectors).
Caption-Then-Embed
Use a VLM (GPT-4o, Claude, Gemini) to generate detailed text descriptions of images, charts, and tables. Embed the captions with a text embedding model alongside text chunks.
Best for: When you need text-based vector search and can afford LLM calls at index time.Storage: ~3KB per chunk (768-dim float32 vector).
Unified Embedding Space (CLIP/SigLIP)
Embed images and text into the same vector space using CLIP or SigLIP. Query with text, retrieve images directly via cosine similarity.
Best for: Photo search, product catalogs, natural image collections.Limitation: Poor on documents, tables, charts — trained on natural images.
Hybrid: Parallel Text + Vision Indexes
Maintain separate indexes: text chunks in one, page images in another. Query both, combine results with reciprocal rank fusion (RRF) or learned weights.
Best for: Maximum recall when some documents are text-heavy and others are visual.Cost: 2x indexing compute, more complex orchestration.
Code: Caption-Then-Embed Pipeline
The most practical approach for teams without GPU infrastructure. Uses an LLM to convert visual content to text, then embeds everything with a standard text model.
import base64
from openai import OpenAI
from sentence_transformers import SentenceTransformer
client = OpenAI()
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")
def caption_image(image_path: str, context: str = "") -> str:
"""Use GPT-4o to generate a retrieval-optimized caption."""
with open(image_path, "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
prompt = (
"You are generating a detailed description of this image for "
"a search index. Include ALL visible text, numbers, labels, "
"legends, axis titles, and visual patterns. If it's a table, "
"reproduce the full table structure. If it's a chart, describe "
"trends, values, and comparisons."
)
if context:
prompt += f"\nSurrounding document context: {context}"
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{image_b64}"
}}
]
}],
max_tokens=1000
)
return response.choices[0].message.content
def build_multimodal_index(documents: list):
"""Build a unified text index from mixed content."""
entries = []
for doc in documents:
if doc["type"] == "text":
entries.append({
"content": doc["text"],
"source": doc["source"],
"modality": "text",
})
elif doc["type"] == "image":
caption = caption_image(
doc["path"],
context=doc.get("surrounding_text", "")
)
entries.append({
"content": caption,
"source": doc["path"],
"modality": "image",
"caption": caption, # store for display
})
elif doc["type"] == "table":
# Tables get BOTH Markdown and NL description
caption = caption_image(doc["screenshot_path"])
entries.append({
"content": caption,
"source": doc["source"],
"modality": "table",
"markdown": doc.get("markdown", ""),
})
texts = [e["content"] for e in entries]
embeddings = embedder.encode(texts, normalize_embeddings=True)
return entries, embeddings # Store in your vector DBCost consideration
Captioning with GPT-4o costs ~$0.01–0.03 per image (depending on resolution). For a 500-page PDF with 100 images and 50 tables, expect ~$1.50–4.50 in captioning costs at index time. This is a one-time cost per document version.
Code: ColPali Retrieval Pipeline
The page-as-image approach. Requires a GPU for indexing but eliminates all preprocessing.
from colpali_engine.models import ColPali, ColPaliProcessor
from pdf2image import convert_from_path
import torch
# Load ColPali (requires ~6GB GPU memory)
model = ColPali.from_pretrained(
"vidore/colpali-v1.2",
torch_dtype=torch.bfloat16,
device_map="cuda",
)
processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2")
def index_pdf(pdf_path: str) -> list[torch.Tensor]:
"""Convert each page to an image and embed it."""
pages = convert_from_path(pdf_path, dpi=144)
page_embeddings = []
for page_image in pages:
inputs = processor.process_images([page_image])
inputs = {k: v.to("cuda") for k, v in inputs.items()}
with torch.no_grad():
embs = model(**inputs) # (1, n_patches, 128)
page_embeddings.append(embs.squeeze(0).cpu())
return page_embeddings # List of (n_patches, 128) tensors
def search(query: str, page_embeddings: list, top_k: int = 5):
"""Retrieve most relevant pages using MaxSim scoring."""
inputs = processor.process_queries([query])
inputs = {k: v.to("cuda") for k, v in inputs.items()}
with torch.no_grad():
q_embs = model(**inputs).squeeze(0) # (n_tokens, 128)
scores = []
for page_idx, p_embs in enumerate(page_embeddings):
# MaxSim: for each query token, max similarity across patches
sim_matrix = q_embs @ p_embs.to("cuda").T # (n_tokens, n_patches)
max_sims = sim_matrix.max(dim=1).values # (n_tokens,)
score = max_sims.sum().item()
scores.append((page_idx, score))
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]
# Usage
embeddings = index_pdf("quarterly_report.pdf")
results = search("Q3 revenue breakdown by segment", embeddings)
for page_idx, score in results:
print(f"Page {page_idx + 1}: score={score:.2f}")On an A100 GPU, ColPali indexes ~2–3 pages/second at 1024x1024 resolution. A 100-page PDF indexes in ~40 seconds. Query latency is ~50ms for 1000 pages (MaxSim over precomputed embeddings). For production, store patch embeddings in a vector database with multi-vector support (Vespa, Qdrant).
When to Use What: Decision Matrix
| Scenario | Approach | Why |
|---|---|---|
| PDF reports with charts, tables | ColPali | Layout matters, extraction errors are costly |
| Scanned contracts / historical docs | ColPali | No OCR needed, handles low-quality scans |
| E-commerce product search | CLIP/SigLIP | Natural images, trained on similar distribution |
| Mostly-text docs with some figures | Caption-then-embed | Simple, no GPU needed, good enough for sparse visuals |
| Mixed corpus (text + visual docs) | Hybrid indexes | Maximum recall across diverse content types |
| High-volume, latency-critical | Caption-then-embed | Single-vector search is 10x faster than multi-vector |
| Academic papers with equations | ColPali or Nougat+embed | LaTeX extraction is brittle, vision handles equations natively |
Rule of Thumb (2025)
Start with ColPali if you have GPU access and your documents are visually rich. Start with caption-then-embed if you don't have GPUs or your documents are mostly text. Add a hybrid layer when neither alone achieves the recall you need. The field is moving decisively toward vision-first approaches — ColPali-style architectures will likely be the default within 12 months.
Production Architecture Patterns
Building a multimodal RAG system in production requires more than just an embedding model. Here are three architecture patterns that work at scale.
Pattern 1: Retrieve-Then-Read
Query → ColPali retrieves top-5 page images
→ VLM (GPT-4o / Claude) reads retrieved pages
→ Generates answer with page citations
# Pros: High accuracy, visual grounding
# Cons: VLM API cost per query (~$0.02-0.10)
# Best for: Internal knowledge bases, compliance queriesPattern 2: Retrieve-Summarize-Read
Query → ColPali retrieves top-20 page images
→ Lightweight VLM extracts relevant snippets from each page
→ Text snippets concatenated as context
→ Text LLM generates final answer
# Pros: Cheaper than Pattern 1 (text LLM is cheaper than VLM)
# Cons: Lossy summarization step, more complex pipeline
# Best for: High-volume customer-facing systemsPattern 3: Dual-Index Fusion
Query → Text embedder searches text chunk index → top-10 text results
→ ColPali searches page image index → top-10 page results
→ Reciprocal Rank Fusion merges both lists
→ Top-5 unique results sent to LLM for answer
# Pros: Maximum recall, handles all document types
# Cons: 2x indexing cost, fusion tuning required
# Best for: Enterprise search across heterogeneous document collectionsOpen Problems and Active Research
Multimodal RAG is a rapidly evolving field. Several hard problems remain unsolved as of early 2025:
Multi-page reasoning
Current systems retrieve individual pages. But many questions require synthesizing information across multiple pages — "How did the risk factors change between this year's and last year's 10-K?" requires comparing two different pages from two different documents. No current retrieval model handles this natively; it requires multi-hop orchestration at the application layer.
Storage efficiency for late interaction
ColPali stores ~1030 vectors per page vs 1 vector for single-vector models. For a million-page corpus, this means ~500GB of embedding storage vs ~3GB. Quantization (int8, binary) helps, but the gap remains significant. Research into embedding compression (Product Quantization, Matryoshka-style dimensionality reduction for multi-vector) is active.
Sub-page retrieval
ColPali retrieves whole pages, but often only a small region is relevant. Can we use the patch-level similarity scores to highlight or crop the relevant region? Early work on "attention heatmaps" from ColPali shows promise — the MaxSim scores naturally form a spatial attention map over the page.
Evaluation benchmarks
ViDoRe (the main benchmark ColPali uses) covers academic and financial documents but lacks coverage for many real-world domains: medical records, engineering drawings, legal contracts with handwritten annotations, multilingual documents. The community needs broader, more realistic evaluation sets.
Key Takeaways
- 1
Text-only RAG is fundamentally blind -- charts, tables, diagrams, and figures carry critical information that text extraction degrades or destroys.
- 2
ColPali inverted the paradigm -- instead of extracting content from pages, it embeds pages as images with late interaction. Simpler pipeline, better results.
- 3
Tables are the hardest modality -- spatial relationships between cells are destroyed by text extraction. Vision-based approaches handle them most reliably.
- 4
Caption-then-embed is the practical fallback -- when you lack GPU infrastructure, using an LLM to caption visual content and embedding the captions gives 80% of the benefit at minimal infrastructure cost.
- 5
The field is moving fast -- ColPali was published in June 2024 and has already spawned ColQwen2, ColSmol, and DSE. Vision-first retrieval is becoming the default for document understanding.
Further Reading
- ColPali paper -- Faysse et al. (2024). The foundational paper for vision-language document retrieval.
- CLIP paper -- Radford et al. (2021). The scaling breakthrough that made cross-modal embedding practical.
- ColBERT paper -- Khattab & Zaharia (2020). The late interaction mechanism that ColPali adapts for vision.
- PaliGemma paper -- Beyer et al. (2024). The VLM backbone used in ColPali.
- ColPali blog post -- Manuel Faysse's walkthrough with code examples and benchmarks.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.