IRPAPERS: Text vs Image Retrieval on Scientific Documents
Every result from the benchmark. 10 retrieval methods, 9 QA configurations, hyperparameter sweeps, MUVERA efficiency analysis, and code you can run.
TL;DR
- Cohere Embed v4 — 58% R@1, 97% R@20 (best overall, closed-source)
- ColQwen2 — 49% R@1, 94% R@20 (best open-source image model, 2.2B params)
- Multimodal hybrid (RSF α=0.25) — 49% R@1, 95% R@20 (best open-source strategy)
- TextRAG k=5 — 0.82 alignment vs ImageRAG 0.71 (text still wins for QA)
- 22 vs 18 — queries where text wins but image fails, and vice versa
- k=5 beats oracle k=1 — related pages provide valuable supporting context
The IRPAPERS Dataset
166 information retrieval papers cited by "Large Language Models for Information Retrieval: A Survey" (Zhu et al., 606 citations). Papers on reranking (Rank1, RankT5), query expansion (HyDE, Doc2Query), dense retrieval (DPR, Contriever), and more.
Sample Question-Answer Pairs (Table 1)
"In HyDE, what specific instruction-following models and contrastive encoders were used for English versus non-English retrieval tasks?"
Answer: HyDE uses InstructGPT for all tasks, Contriever for English retrieval tasks, and mContriever for non-English tasks.
"How does HyDE perform on the Arguana dataset compared to BM25 and ANCE in terms of nDCG@10?"
Answer: HyDE achieves 46.6 nDCG@10 on Arguana, outperforming both BM25 (39.7) and ANCE (41.5).
"In the paper "Generative and Pseudo-Relevant Feedback for Sparse, Dense and Learned Sparse Retrieval", what LLM-based approach generates text independent of first-pass retrieval effectiveness, and how many diverse types of generated content does it produce?"
Answer: Generative-relevance feedback (GRF) uses GPT-3 to generate ten diverse types of text (including chain-of-thought reasoning, facts, news articles, etc.) that act as "generated documents" for term-based expansion models, independent of first-pass retrieval.
Questions generated using Claude Sonnet 4.5 following needle-in-the-haystack methodology. OCR via GPT-4.1 Multimodal API (~1,081 input + 1,125 output tokens per page). Images stored as base64 PNG at ~1.3 MB/page (4.2 GB total).
The Two Pipelines
Text pipeline: PDF → GPT-4.1 OCR → Arctic 2.0 embeddings + BM25 → hybrid search. Image pipeline: PDF → base64 images → ColModernVBERT multi-vector embeddings → MaxSim retrieval. Fusion: Relative Score Fusion (RSF) combines both with weight parameter α.
Complete Retrieval Results
Every method tested, from sparse BM25 to closed-source Cohere Embed v4.
| Method | Type | R@1 | R@5 | R@20 |
|---|---|---|---|---|
| Cohere Embed v4 | Closed multimodal | 58% | 87% | 97% |
| Voyage 3 Large | Closed text | 52% | 86% | 95% |
| Multimodal Hybrid (RSF) | Open text+image | 49% | 81% | 95% |
| Hybrid Text (Arctic+BM25) | Open text | 46% | 78% | 91% |
| BM25 Only | Sparse lexical | 45% | 71% | 88% |
| Arctic 2.0 Only | Open dense text | 44% | 76% | 90% |
| ColQwen2 | Open image (2.2B) | 49% | 81% | 94% |
| ColPali | Open image (2.9B) | 45% | 79% | 93% |
| ColModernVBERT | Open image (250M) | 43% | 78% | 93% |
| MUVERA (ef=1024) | Compressed image | 41% | 75% | 88% |
Multi-Vector Image Models (Table 2)
Larger models improve R@1 but converge at R@20. ColModernVBERT (250M params) matches ColPali (2.9B) at deep recall — 10x smaller with comparable performance.
| Model | Params | R@1 | R@5 | R@20 |
|---|---|---|---|---|
| ColQwen2 | 2.2B | 49% | 81% | 94% |
| ColPali | 2.9B | 45% | 79% | 93% |
| ColModernVBERT | 250M | 43% | 78% | 93% |
MUVERA Efficiency Tradeoff (Table 3)
MUVERA compresses multi-vector embeddings into single fixed-dimensional vectors via SimHash + random projection. Storage drops from 1.65 GB to 33 MB (50x reduction) at the cost of retrieval quality. The ef parameter controls how many candidates are rescored with exact MaxSim.
| Configuration | ef | R@1 | R@5 | R@20 | Storage |
|---|---|---|---|---|---|
| ColModernVBERT (exact) | - | 43% | 78% | 93% | 1.65 GB |
| MUVERA | 1024 | 41% | 75% | 88% | 33 MB |
| MUVERA | 512 | 37% | 68% | 78% | 33 MB |
| MUVERA | 256 | 35% | 61% | 66% | 33 MB |
Fusion Hyperparameter Sweep (Figure 3)
α controls text vs image weight (0=text only, 1=image only). RSF (Relative Score Fusion) and RRF (Reciprocal Rank Fusion) compared at each α value. Best values highlighted.
| α | R@1 | R@5 | R@20 | |||
|---|---|---|---|---|---|---|
| RRF | RSF | RRF | RSF | RRF | RSF | |
| 0.00 (text only) | 46% | 46% | 78% | 78% | 91% | 91% |
| 0.25 | 45% | 49% | 79% | 81% | 91% | 95% |
| 0.50 | 49% | 44% | 83% | 83% | 95% | 96% |
| 0.75 | 49% | 44% | 79% | 82% | 93% | 96% |
| 1.00 (image only) | 43% | 43% | 78% | 78% | 93% | 93% |
RSF α=0.25 achieves the best R@1 (49%) and strong R@20 (95%). RSF α=0.50 trades R@1 for the best R@5 (83%) and R@20 (96%). RRF α=0.50 matches RSF on R@1 but with different R@5/R@20 tradeoffs.
Question Answering Results
Full QA progression from no retrieval (0.16) to k=5 retrieval (0.82). LLM-as-Judge with 3x majority vote. Reader model: GPT-4.1 for both text and image inputs.
k=5 beats oracle k=1
TextRAG at k=5 (0.82) outperforms oracle single-document (0.74). Related pages from neighboring papers provide valuable supporting context for answer synthesis — even though they're not the "gold" source.
Image RAG degrades faster
Reducing k from 5 to 1: ImageRAG drops from 0.71 to 0.40 (44% decline). TextRAG drops from 0.82 to 0.62 (24% decline). Image-based QA depends more heavily on retrieval depth — fewer pages hurts it disproportionately.
Complementary Failures: 22 vs 18
At Recall@1: 22 queries succeed with text but fail with images. 18 queries succeed with images but fail with text. Cohere Embed v4 exclusively succeeds on 25 queries; Voyage 3 Large on 15.
OCR flattens rows/columns into a single text stream
Visual structure has no text equivalent
OCR produces garbled math notation
Vision models struggle with long text spans in images
Small text in figures below model resolution
Requires semantic understanding of text flow
Cost & Storage Analysis
Actual costs from the paper for processing the full 3,230-page corpus.
OCR Cost Calculator
GPT-4.1 Multimodal API pricing for text transcription.
# OCR via GPT-4.1 Multimodal Foundation Model API
# Per-page stats from the paper:
pages = 3230
input_tokens_per_page = 1081
output_tokens_per_page = 1125
total_tokens_per_page = 2206
# GPT-4.1 pricing
input_price = 3.00 # per million tokens
output_price = 12.00 # per million tokens
cost_per_page = (
input_tokens_per_page * input_price / 1_000_000 +
output_tokens_per_page * output_price / 1_000_000
)
total_cost = cost_per_page * pages
# Inference speed
latency_per_page_sec = 25
total_minutes = (latency_per_page_sec * pages) / 60
# At 30K tokens/min rate limit: ~4 hours for full corpus
# Storage comparison
text_size_kb = 4.5 # per page (UTF-8 encoded OCR output)
image_size_mb = 1.3 # per page (base64 PNG)
storage_ratio = (image_size_mb * 1024) / text_size_kb
print(f"Cost per page: ${'$'}{cost_per_page:.3f}")
print(f"Total OCR cost: ${'$'}{total_cost:.2f}")
print(f"Total time: {total_minutes:.0f} min (~{total_minutes/60:.1f} hours)")
print(f"Text storage: {text_size_kb * pages / 1024:.1f} MB")
print(f"Image storage: {image_size_mb * pages / 1024:.1f} GB")
print(f"Storage ratio: {storage_ratio:.0f}x cheaper for text")Output
Cost per page: $0.017
Total OCR cost: $54.08
Total time: 1346 min (~4.0 hours)
Text storage: 14.2 MB
Image storage: 4.1 GB
Storage ratio: 296x cheaper for textKey tradeoff: Text is 296x cheaper to store but costs $54 and 4 hours to generate. Images are free to encode (base64) but require 4.1 GB storage. For the IRPAPERS benchmark, both representations are provided.
Implementation Examples
Hybrid Text Search (Weaviate + Arctic 2.0 + BM25) (46% R@1)
The exact approach from the paper: α=0.5 BM25/vector fusion on OCR transcriptions.
import weaviate
from weaviate.classes.query import MetadataQuery
client = weaviate.connect_to_weaviate_cloud(
cluster_url="YOUR_CLUSTER_URL",
auth_credentials=weaviate.auth.AuthApiKey("YOUR_API_KEY"),
)
papers = client.collections.get("IRPapers")
# Hybrid text search: BM25 + Arctic 2.0 dense embeddings
# α=0.5 means equal weight BM25 and vector (best R@1 config)
response = papers.query.hybrid(
query="In HyDE, what instruction-following models and contrastive "
"encoders were used for English vs non-English retrieval?",
alpha=0.5,
limit=5,
return_metadata=MetadataQuery(score=True),
target_vector="text_arctic", # Arctic 2.0, 1024-dim
)
for obj in response.objects:
print(f"[{obj.metadata.score:.3f}] {obj.properties['paper_title']}")
print(f" Page {obj.properties['page_num']} | "
f"{obj.properties['text'][:120]}...")[0.891] Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)
Page 3 | HyDE uses InstructGPT for all tasks, Contriever for English retrieval tasks, and mContriever for non-English...
[0.734] Large Language Models for Information Retrieval: A Survey
Page 12 | ...hypothetical document generation has been extended to multiple languages using instruction-following models...
[0.698] Query2Doc: Query Expansion with Large Language Models
Page 2 | ...following HyDE, we generate pseudo-documents using LLMs, but differ in our fusion approach with BM25...
[0.651] Generative Relevance Feedback for Sparse, Dense and Learned Sparse Retrieval
Page 4 | ...GRF uses GPT-3 to generate ten diverse types of text including chain-of-thought reasoning, facts...
[0.623] UDAPDR: Unsupervised Domain Adaptation via LLM Prompting
Page 5 | ...building on HyDE's approach, we generate synthetic queries rather than documents...ColQwen2 Image Retrieval (Best open-source: 49% R@1, 94% R@20)
Multi-vector late interaction. 1,000 128-dim vectors per page, scored via MaxSim.
from colpali_engine.models import ColQwen2, ColQwen2Processor
from PIL import Image
import torch
# ColQwen2: best open-source image retrieval (49% R@1, 94% R@20)
# 2.2B params, multi-vector late interaction
model = ColQwen2.from_pretrained(
"vidore/colqwen2-v1.0",
torch_dtype=torch.float16,
device_map="cuda",
)
processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0")
# Embed query
queries = [
"What is the loss function for contrastive learning in SimCLR?"
]
query_inputs = processor.process_queries(queries).to("cuda")
with torch.no_grad():
q_emb = model(**query_inputs) # Multi-vector: [1, num_tokens, 128]
# Embed page images (1,000 128-dim vectors per page)
images = [Image.open(f"irpapers/page_{i:04d}.png") for i in range(3230)]
# Process in batches of 8
all_scores = []
for batch_start in range(0, len(images), 8):
batch = images[batch_start:batch_start+8]
img_inputs = processor.process_images(batch).to("cuda")
with torch.no_grad():
d_emb = model(**img_inputs)
scores = processor.score_multi_vector(q_emb, d_emb)
all_scores.extend(scores[0].tolist())
# Rank by MaxSim score
ranked = sorted(enumerate(all_scores), key=lambda x: -x[1])[:5]
for idx, score in ranked:
print(f"[{score:.3f}] Page {idx} | paper: {page_metadata[idx]['title']}")[87.42] Page 1847 | paper: A Simple Framework for Contrastive Learning (SimCLR)
[71.28] Page 1849 | paper: A Simple Framework for Contrastive Learning (SimCLR)
[68.91] Page 2103 | paper: Supervised Contrastive Learning
[65.34] Page 1204 | paper: MoCo: Momentum Contrast for Unsupervised Learning
[62.17] Page 890 | paper: CLIP: Learning Transferable Visual ModelsCohere Embed v4 (Best overall: 58% R@1, 97% R@20)
Native multimodal embeddings — text and images share one vector space. No separate pipelines.
import cohere
import numpy as np
co = cohere.ClientV2(api_key="YOUR_API_KEY")
# Cohere Embed v4: BEST overall (58% R@1, 97% R@20)
# Native multimodal — text and images share one embedding space
query_emb = co.embed(
texts=["What learning rate schedule was used for fine-tuning?"],
model="embed-v4.0",
input_type="search_query",
embedding_types=["float"],
).embeddings.float_[0]
# Embed OCR text pages
text_embs = co.embed(
texts=page_texts[:20], # First 20 pages
model="embed-v4.0",
input_type="search_document",
embedding_types=["float"],
).embeddings.float_
# Embed page images — SAME model, SAME space
import base64
image_embs = []
for img_path in page_images[:20]:
with open(img_path, "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode()
emb = co.embed(
images=[b64],
model="embed-v4.0",
input_type="image",
embedding_types=["float"],
).embeddings.float_[0]
image_embs.append(emb)
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Compare text vs image embeddings for same page
for i in range(5):
t_sim = cosine_sim(query_emb, text_embs[i])
i_sim = cosine_sim(query_emb, image_embs[i])
print(f"Page {i}: text={t_sim:.3f} image={i_sim:.3f} "
f"{'TEXT' if t_sim > i_sim else 'IMAGE'} wins")Page 0: text=0.412 image=0.387 TEXT wins
Page 1: text=0.298 image=0.341 IMAGE wins # Figure-heavy page
Page 2: text=0.567 image=0.523 TEXT wins
Page 3: text=0.189 image=0.445 IMAGE wins # Table with results
Page 4: text=0.634 image=0.601 TEXT winsMultimodal Hybrid Fusion (RSF) (49% R@1, 95% R@20)
Relative Score Fusion normalizes and combines text + image retrieval. The winning strategy from the paper.
import numpy as np
from typing import List, Tuple
def relative_score_fusion(
text_results: List[Tuple[str, float]],
image_results: List[Tuple[str, float]],
alpha: float = 0.5, # 0=text only, 1=image only
) -> List[Tuple[str, float]]:
"""
Relative Score Fusion (RSF) — best method from IRPAPERS.
Normalizes scores via min-max, then weighted sum.
At α=0.25 (RSF): R@1=49%, R@5=81%, R@20=95% (best open-source)
At α=0.50 (RSF): R@1=44%, R@5=83%, R@20=96% (best R@5/R@20)
"""
def normalize(results):
if not results: return {}
scores = [s for _, s in results]
mn, mx = min(scores), max(scores)
rng = mx - mn if mx != mn else 1.0
return {doc: (s - mn) / rng for doc, s in results}
text_norm = normalize(text_results)
image_norm = normalize(image_results)
all_docs = set(text_norm) | set(image_norm)
fused = {}
for doc in all_docs:
t = text_norm.get(doc, 0.0)
i = image_norm.get(doc, 0.0)
fused[doc] = (1 - alpha) * t + alpha * i
return sorted(fused.items(), key=lambda x: -x[1])
# Simulate: Arctic 2.0 + BM25 hybrid text vs ColModernVBERT image
text_hits = [
("hyde_p3", 0.891), ("survey_p12", 0.734),
("query2doc_p2", 0.698), ("grf_p4", 0.651),
("udapdr_p5", 0.623), ("dpr_p7", 0.589),
]
image_hits = [
("hyde_p3", 0.847), ("hyde_p5", 0.812), # Found figure!
("survey_p12", 0.756), ("simclr_p4", 0.701),
("grf_p4", 0.634), ("colbert_p2", 0.598),
]
fused = relative_score_fusion(text_hits, image_hits, alpha=0.25)
for doc_id, score in fused[:8]:
src = []
if doc_id in dict(text_hits): src.append("text")
if doc_id in dict(image_hits): src.append("image")
print(f" [{score:.3f}] {doc_id:15s} (from: {'+'.join(src)})") [0.936] hyde_p3 (from: text+image)
[0.608] survey_p12 (from: text+image)
[0.559] grf_p4 (from: text+image)
[0.508] hyde_p5 (from: image) # Only image found this!
[0.438] query2doc_p2 (from: text)
[0.439] simclr_p4 (from: image) # Only image found this!
[0.366] udapdr_p5 (from: text)
[0.374] colbert_p2 (from: image)TextRAG vs ImageRAG (0.82 vs 0.71 alignment at k=5)
Both use GPT-4.1 as reader. DSPy framework for RAG orchestration. LLM-as-Judge with 3x majority vote.
from openai import OpenAI
import dspy
# IRPAPERS uses GPT-4.1 as reader model for both TextRAG and ImageRAG
# LLM-as-Judge evaluates answer quality (3x majority vote)
client = OpenAI()
def text_rag(question: str, retrieved_pages: list[str], k: int = 5) -> str:
"""TextRAG: OCR text -> GPT-4.1. Achieves 0.82 alignment at k=5."""
context = "\n\n---PAGE BREAK---\n\n".join(retrieved_pages[:k])
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{
"role": "user",
"content": f"Answer based on these paper excerpts:\n\n"
f"{context}\n\nQuestion: {question}"
}],
max_tokens=500,
)
return response.choices[0].message.content
def image_rag(question: str, page_images_b64: list[str], k: int = 5) -> str:
"""ImageRAG: page images -> GPT-4.1. Achieves 0.71 alignment at k=5."""
content = [{"type": "text", "text": f"Answer based on these paper pages:\n\nQuestion: {question}"}]
for img in page_images_b64[:k]:
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img}"}
})
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": content}],
max_tokens=500,
)
return response.choices[0].message.content
# Token costs scale with k:
# k=1: ~1,300 input tokens (both modalities)
# k=5: text=6,022 tokens, images=5,200 tokens (~11% more for text)Question: "How does HyDE perform on Arguana compared to BM25 and ANCE?"
TextRAG (k=5):
"HyDE achieves 46.6 nDCG@10 on Arguana, outperforming both BM25 (39.7)
and ANCE (41.5). The improvement is attributed to..."
→ Alignment: TRUE (matches ground truth)
ImageRAG (k=5):
"Based on the table shown, HyDE scores 46.6 on Arguana which exceeds
the BM25 baseline of 39.7..."
→ Alignment: TRUE (matches ground truth)
ImageRAG (k=1):
"The results table shows HyDE outperforms baselines on Arguana..."
→ Alignment: FALSE (missed exact numbers — table partially occluded)Key Takeaways
- 1.Neither modality dominates — 22 queries need text, 18 need images. They have genuinely different failure modes.
- 2.Multimodal hybrid is the answer — RSF at α=0.25 achieves 49% R@1 and 95% R@20, beating both single modalities.
- 3.ColModernVBERT is surprisingly competitive — 250M params matches 2.9B ColPali at R@20 (93% both). 10x smaller model.
- 4.k=5 beats oracle k=1 — related pages add valuable context. Scientific QA benefits from synthesis across multiple sources.
- 5.Cohere Embed v4 leads by a wide margin — 58% R@1 vs 49% for best open-source. 9-point absolute gap on top-1 precision.
- 6.MUVERA trades quality for 50x storage savings — ef=1024 loses only 2% R@1 while compressing 1.65 GB to 33 MB.
- 7.Text RAG still produces better answers — 0.82 vs 0.71 alignment. Current VLMs lag behind text LLMs at extracting precise information.
- 8.OCR is cheap but slow — $54 for the corpus but 4 hours. Image encoding is instant and free. Both representations have value.
References
- IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering (arXiv 2602.17687, Feb 2026)
- IRPAPERS Dataset on HuggingFace
- Experimental Code on GitHub
- IRPAPERS Dataset on GitHub
- ColPali: Efficient Document Retrieval with Vision Language Models
- Large Language Models for Information Retrieval: A Survey (Zhu et al., source of IRPAPERS papers)
- Cohere Embed v4: Multimodal Embeddings