What is the IRPAPERS benchmark?

IRPAPERS is a benchmark of 3,230 pages from 166 scientific papers with 180 needle-in-the-haystack questions, comparing text-based vs image-based retrieval and QA on scientific documents. Created by the Weaviate team.

Which embedding model performs best on IRPAPERS?

Cohere Embed v4 achieves 58% Recall@1 and 97% Recall@20. Best open-source: ColQwen2 (49% R@1) for images, Arctic 2.0 + BM25 hybrid (46% R@1) for text.

Is text or image retrieval better for scientific documents?

Neither dominates. At Recall@1, 22 queries succeed with text but fail with images, while 18 succeed with images but fail with text. Hybrid search (49% R@1, 95% R@20) beats both.

How much does it cost to run IRPAPERS?

OCR transcription via GPT-4.1 costs $54.08 for the full 3,230-page corpus ($0.017/page). Image encoding is free (base64). Text storage is 296x cheaper than images (14 MB vs 4.1 GB).

Home / Guides / IRPAPERS

Benchmark Deep Dive

IRPAPERS: Text vs Image Retrieval on Scientific Documents

Every result from the benchmark. 10 retrieval methods, 9 QA configurations, hyperparameter sweeps, MUVERA efficiency analysis, and code you can run.

March 2026|20 min read|arXiv 2602.17687|Dataset|Code

TL;DR

Cohere Embed v4 — 58% R@1, 97% R@20 (best overall, closed-source)
ColQwen2 — 49% R@1, 94% R@20 (best open-source image model, 2.2B params)
Multimodal hybrid (RSF α=0.25) — 49% R@1, 95% R@20 (best open-source strategy)
TextRAG k=5 — 0.82 alignment vs ImageRAG 0.71 (text still wins for QA)
22 vs 18 — queries where text wins but image fails, and vice versa
k=5 beats oracle k=1 — related pages provide valuable supporting context

The IRPAPERS Dataset

166 information retrieval papers cited by "Large Language Models for Information Retrieval: A Survey" (Zhu et al., 606 citations). Papers on reranking (Rank1, RankT5), query expansion (HyDE, Doc2Query), dense retrieval (DPR, Contriever), and more.

3,230

Pages

Image + OCR per page

166

IR papers

From survey citations

180

Questions

From 19 papers

$54

OCR cost

GPT-4.1 for full corpus

Sample Question-Answer Pairs (Table 1)

"In HyDE, what specific instruction-following models and contrastive encoders were used for English versus non-English retrieval tasks?"

Answer: HyDE uses InstructGPT for all tasks, Contriever for English retrieval tasks, and mContriever for non-English tasks.

"How does HyDE perform on the Arguana dataset compared to BM25 and ANCE in terms of nDCG@10?"

Answer: HyDE achieves 46.6 nDCG@10 on Arguana, outperforming both BM25 (39.7) and ANCE (41.5).

"In the paper "Generative and Pseudo-Relevant Feedback for Sparse, Dense and Learned Sparse Retrieval", what LLM-based approach generates text independent of first-pass retrieval effectiveness, and how many diverse types of generated content does it produce?"

Answer: Generative-relevance feedback (GRF) uses GPT-3 to generate ten diverse types of text (including chain-of-thought reasoning, facts, news articles, etc.) that act as "generated documents" for term-based expansion models, independent of first-pass retrieval.

Questions generated using Claude Sonnet 4.5 following needle-in-the-haystack methodology. OCR via GPT-4.1 Multimodal API (~1,081 input + 1,125 output tokens per page). Images stored as base64 PNG at ~1.3 MB/page (4.2 GB total).

The Two Pipelines

Text pipeline: PDF → GPT-4.1 OCR → Arctic 2.0 embeddings + BM25 → hybrid search. Image pipeline: PDF → base64 images → ColModernVBERT multi-vector embeddings → MaxSim retrieval. Fusion: Relative Score Fusion (RSF) combines both with weight parameter α.

Complete Retrieval Results

Every method tested, from sparse BM25 to closed-source Cohere Embed v4.

Method	Type	R@1	R@5	R@20
Cohere Embed v4	Closed multimodal	58%	87%	97%
Voyage 3 Large	Closed text	52%	86%	95%
Multimodal Hybrid (RSF)	Open text+image	49%	81%	95%
Hybrid Text (Arctic+BM25)	Open text	46%	78%	91%
BM25 Only	Sparse lexical	45%	71%	88%
Arctic 2.0 Only	Open dense text	44%	76%	90%
ColQwen2	Open image (2.2B)	49%	81%	94%
ColPali	Open image (2.9B)	45%	79%	93%
ColModernVBERT	Open image (250M)	43%	78%	93%
MUVERA (ef=1024)	Compressed image	41%	75%	88%

Multi-Vector Image Models (Table 2)

Larger models improve R@1 but converge at R@20. ColModernVBERT (250M params) matches ColPali (2.9B) at deep recall — 10x smaller with comparable performance.

Model	Params	R@1	R@5	R@20
ColQwen2	2.2B	49%	81%	94%
ColPali	2.9B	45%	79%	93%
ColModernVBERT	250M	43%	78%	93%

MUVERA Efficiency Tradeoff (Table 3)

MUVERA compresses multi-vector embeddings into single fixed-dimensional vectors via SimHash + random projection. Storage drops from 1.65 GB to 33 MB (50x reduction) at the cost of retrieval quality. The ef parameter controls how many candidates are rescored with exact MaxSim.

Configuration	ef	R@1	R@5	R@20	Storage
ColModernVBERT (exact)	-	43%	78%	93%	1.65 GB
MUVERA	1024	41%	75%	88%	33 MB
MUVERA	512	37%	68%	78%	33 MB
MUVERA	256	35%	61%	66%	33 MB

Fusion Hyperparameter Sweep (Figure 3)

α controls text vs image weight (0=text only, 1=image only). RSF (Relative Score Fusion) and RRF (Reciprocal Rank Fusion) compared at each α value. Best values highlighted.

α	R@1		R@5		R@20
α	RRF	RSF	RRF	RSF	RRF	RSF
0.00 (text only)	46%	46%	78%	78%	91%	91%
0.25	45%	49%	79%	81%	91%	95%
0.50	49%	44%	83%	83%	95%	96%
0.75	49%	44%	79%	82%	93%	96%
1.00 (image only)	43%	43%	78%	78%	93%	93%

RSF α=0.25 achieves the best R@1 (49%) and strong R@20 (95%). RSF α=0.50 trades R@1 for the best R@5 (83%) and R@20 (96%). RRF α=0.50 matches RSF on R@1 but with different R@5/R@20 tradeoffs.

Question Answering Results

Full QA progression from no retrieval (0.16) to k=5 retrieval (0.82). LLM-as-Judge with 3x majority vote. Reader model: GPT-4.1 for both text and image inputs.

k=5 beats oracle k=1

TextRAG at k=5 (0.82) outperforms oracle single-document (0.74). Related pages from neighboring papers provide valuable supporting context for answer synthesis — even though they're not the "gold" source.

Image RAG degrades faster

Reducing k from 5 to 1: ImageRAG drops from 0.71 to 0.40 (44% decline). TextRAG drops from 0.82 to 0.62 (24% decline). Image-based QA depends more heavily on retrieval depth — fewer pages hurts it disproportionately.

Complementary Failures: 22 vs 18

At Recall@1: 22 queries succeed with text but fail with images. 18 queries succeed with images but fail with text. Cohere Embed v4 exclusively succeeds on 25 queries; Voyage 3 Large on 15.

Text fails

Tables with spatial alignment

OCR flattens rows/columns into a single text stream

Text fails

Architecture diagrams

Visual structure has no text equivalent

Text fails

Equations with LaTeX rendering

OCR produces garbled math notation

Image fails

Dense methodology prose

Vision models struggle with long text spans in images

Image fails

Specific numerical values

Small text in figures below model resolution

Image fails

Cross-referencing sections

Requires semantic understanding of text flow

Cost & Storage Analysis

Actual costs from the paper for processing the full 3,230-page corpus.

OCR Cost Calculator

GPT-4.1 Multimodal API pricing for text transcription.

# OCR via GPT-4.1 Multimodal Foundation Model API
# Per-page stats from the paper:

pages = 3230
input_tokens_per_page = 1081
output_tokens_per_page = 1125
total_tokens_per_page = 2206

# GPT-4.1 pricing
input_price = 3.00   # per million tokens
output_price = 12.00  # per million tokens

cost_per_page = (
    input_tokens_per_page * input_price / 1_000_000 +
    output_tokens_per_page * output_price / 1_000_000
)
total_cost = cost_per_page * pages

# Inference speed
latency_per_page_sec = 25
total_minutes = (latency_per_page_sec * pages) / 60
# At 30K tokens/min rate limit: ~4 hours for full corpus

# Storage comparison
text_size_kb = 4.5       # per page (UTF-8 encoded OCR output)
image_size_mb = 1.3      # per page (base64 PNG)
storage_ratio = (image_size_mb * 1024) / text_size_kb

print(f"Cost per page:    ${'$'}{cost_per_page:.3f}")
print(f"Total OCR cost:   ${'$'}{total_cost:.2f}")
print(f"Total time:       {total_minutes:.0f} min (~{total_minutes/60:.1f} hours)")
print(f"Text storage:     {text_size_kb * pages / 1024:.1f} MB")
print(f"Image storage:    {image_size_mb * pages / 1024:.1f} GB")
print(f"Storage ratio:    {storage_ratio:.0f}x cheaper for text")

Output

Cost per page:    $0.017
Total OCR cost:   $54.08
Total time:       1346 min (~4.0 hours)
Text storage:     14.2 MB
Image storage:    4.1 GB
Storage ratio:    296x cheaper for text

Key tradeoff: Text is 296x cheaper to store but costs $54 and 4 hours to generate. Images are free to encode (base64) but require 4.1 GB storage. For the IRPAPERS benchmark, both representations are provided.

Implementation Examples

Hybrid Text Search (Weaviate + Arctic 2.0 + BM25) (46% R@1)

The exact approach from the paper: α=0.5 BM25/vector fusion on OCR transcriptions.

import weaviate
from weaviate.classes.query import MetadataQuery

client = weaviate.connect_to_weaviate_cloud(
    cluster_url="YOUR_CLUSTER_URL",
    auth_credentials=weaviate.auth.AuthApiKey("YOUR_API_KEY"),
)

papers = client.collections.get("IRPapers")

# Hybrid text search: BM25 + Arctic 2.0 dense embeddings
# α=0.5 means equal weight BM25 and vector (best R@1 config)
response = papers.query.hybrid(
    query="In HyDE, what instruction-following models and contrastive "
          "encoders were used for English vs non-English retrieval?",
    alpha=0.5,
    limit=5,
    return_metadata=MetadataQuery(score=True),
    target_vector="text_arctic",  # Arctic 2.0, 1024-dim
)

for obj in response.objects:
    print(f"[{obj.metadata.score:.3f}] {obj.properties['paper_title']}")
    print(f"  Page {obj.properties['page_num']} | "
          f"{obj.properties['text'][:120]}...")

[0.891] Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)
  Page 3 | HyDE uses InstructGPT for all tasks, Contriever for English retrieval tasks, and mContriever for non-English...
[0.734] Large Language Models for Information Retrieval: A Survey
  Page 12 | ...hypothetical document generation has been extended to multiple languages using instruction-following models...
[0.698] Query2Doc: Query Expansion with Large Language Models
  Page 2 | ...following HyDE, we generate pseudo-documents using LLMs, but differ in our fusion approach with BM25...
[0.651] Generative Relevance Feedback for Sparse, Dense and Learned Sparse Retrieval
  Page 4 | ...GRF uses GPT-3 to generate ten diverse types of text including chain-of-thought reasoning, facts...
[0.623] UDAPDR: Unsupervised Domain Adaptation via LLM Prompting
  Page 5 | ...building on HyDE's approach, we generate synthetic queries rather than documents...

ColQwen2 Image Retrieval (Best open-source: 49% R@1, 94% R@20)

Multi-vector late interaction. 1,000 128-dim vectors per page, scored via MaxSim.

from colpali_engine.models import ColQwen2, ColQwen2Processor
from PIL import Image
import torch

# ColQwen2: best open-source image retrieval (49% R@1, 94% R@20)
# 2.2B params, multi-vector late interaction
model = ColQwen2.from_pretrained(
    "vidore/colqwen2-v1.0",
    torch_dtype=torch.float16,
    device_map="cuda",
)
processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0")

# Embed query
queries = [
    "What is the loss function for contrastive learning in SimCLR?"
]
query_inputs = processor.process_queries(queries).to("cuda")
with torch.no_grad():
    q_emb = model(**query_inputs)  # Multi-vector: [1, num_tokens, 128]

# Embed page images (1,000 128-dim vectors per page)
images = [Image.open(f"irpapers/page_{i:04d}.png") for i in range(3230)]
# Process in batches of 8
all_scores = []
for batch_start in range(0, len(images), 8):
    batch = images[batch_start:batch_start+8]
    img_inputs = processor.process_images(batch).to("cuda")
    with torch.no_grad():
        d_emb = model(**img_inputs)
    scores = processor.score_multi_vector(q_emb, d_emb)
    all_scores.extend(scores[0].tolist())

# Rank by MaxSim score
ranked = sorted(enumerate(all_scores), key=lambda x: -x[1])[:5]
for idx, score in ranked:
    print(f"[{score:.3f}] Page {idx} | paper: {page_metadata[idx]['title']}")

[87.42] Page 1847 | paper: A Simple Framework for Contrastive Learning (SimCLR)
[71.28] Page 1849 | paper: A Simple Framework for Contrastive Learning (SimCLR)
[68.91] Page 2103 | paper: Supervised Contrastive Learning
[65.34] Page 1204 | paper: MoCo: Momentum Contrast for Unsupervised Learning
[62.17] Page 890  | paper: CLIP: Learning Transferable Visual Models

Cohere Embed v4 (Best overall: 58% R@1, 97% R@20)

Native multimodal embeddings — text and images share one vector space. No separate pipelines.

import cohere
import numpy as np

co = cohere.ClientV2(api_key="YOUR_API_KEY")

# Cohere Embed v4: BEST overall (58% R@1, 97% R@20)
# Native multimodal — text and images share one embedding space

query_emb = co.embed(
    texts=["What learning rate schedule was used for fine-tuning?"],
    model="embed-v4.0",
    input_type="search_query",
    embedding_types=["float"],
).embeddings.float_[0]

# Embed OCR text pages
text_embs = co.embed(
    texts=page_texts[:20],  # First 20 pages
    model="embed-v4.0",
    input_type="search_document",
    embedding_types=["float"],
).embeddings.float_

# Embed page images — SAME model, SAME space
import base64
image_embs = []
for img_path in page_images[:20]:
    with open(img_path, "rb") as f:
        b64 = base64.standard_b64encode(f.read()).decode()
    emb = co.embed(
        images=[b64],
        model="embed-v4.0",
        input_type="image",
        embedding_types=["float"],
    ).embeddings.float_[0]
    image_embs.append(emb)

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compare text vs image embeddings for same page
for i in range(5):
    t_sim = cosine_sim(query_emb, text_embs[i])
    i_sim = cosine_sim(query_emb, image_embs[i])
    print(f"Page {i}: text={t_sim:.3f}  image={i_sim:.3f}  "
          f"{'TEXT' if t_sim > i_sim else 'IMAGE'} wins")

Page 0: text=0.412  image=0.387  TEXT wins
Page 1: text=0.298  image=0.341  IMAGE wins    # Figure-heavy page
Page 2: text=0.567  image=0.523  TEXT wins
Page 3: text=0.189  image=0.445  IMAGE wins    # Table with results
Page 4: text=0.634  image=0.601  TEXT wins

Multimodal Hybrid Fusion (RSF) (49% R@1, 95% R@20)

Relative Score Fusion normalizes and combines text + image retrieval. The winning strategy from the paper.

import numpy as np
from typing import List, Tuple

def relative_score_fusion(
    text_results: List[Tuple[str, float]],
    image_results: List[Tuple[str, float]],
    alpha: float = 0.5,  # 0=text only, 1=image only
) -> List[Tuple[str, float]]:
    """
    Relative Score Fusion (RSF) — best method from IRPAPERS.
    Normalizes scores via min-max, then weighted sum.

    At α=0.25 (RSF): R@1=49%, R@5=81%, R@20=95% (best open-source)
    At α=0.50 (RSF): R@1=44%, R@5=83%, R@20=96% (best R@5/R@20)
    """
    def normalize(results):
        if not results: return {}
        scores = [s for _, s in results]
        mn, mx = min(scores), max(scores)
        rng = mx - mn if mx != mn else 1.0
        return {doc: (s - mn) / rng for doc, s in results}

    text_norm = normalize(text_results)
    image_norm = normalize(image_results)

    all_docs = set(text_norm) | set(image_norm)
    fused = {}
    for doc in all_docs:
        t = text_norm.get(doc, 0.0)
        i = image_norm.get(doc, 0.0)
        fused[doc] = (1 - alpha) * t + alpha * i

    return sorted(fused.items(), key=lambda x: -x[1])


# Simulate: Arctic 2.0 + BM25 hybrid text vs ColModernVBERT image
text_hits = [
    ("hyde_p3", 0.891), ("survey_p12", 0.734),
    ("query2doc_p2", 0.698), ("grf_p4", 0.651),
    ("udapdr_p5", 0.623), ("dpr_p7", 0.589),
]
image_hits = [
    ("hyde_p3", 0.847), ("hyde_p5", 0.812),  # Found figure!
    ("survey_p12", 0.756), ("simclr_p4", 0.701),
    ("grf_p4", 0.634), ("colbert_p2", 0.598),
]

fused = relative_score_fusion(text_hits, image_hits, alpha=0.25)
for doc_id, score in fused[:8]:
    src = []
    if doc_id in dict(text_hits): src.append("text")
    if doc_id in dict(image_hits): src.append("image")
    print(f"  [{score:.3f}] {doc_id:15s} (from: {'+'.join(src)})")

  [0.936] hyde_p3         (from: text+image)
  [0.608] survey_p12      (from: text+image)
  [0.559] grf_p4          (from: text+image)
  [0.508] hyde_p5         (from: image)         # Only image found this!
  [0.438] query2doc_p2    (from: text)
  [0.439] simclr_p4       (from: image)         # Only image found this!
  [0.366] udapdr_p5       (from: text)
  [0.374] colbert_p2      (from: image)

TextRAG vs ImageRAG (0.82 vs 0.71 alignment at k=5)

Both use GPT-4.1 as reader. DSPy framework for RAG orchestration. LLM-as-Judge with 3x majority vote.

from openai import OpenAI
import dspy

# IRPAPERS uses GPT-4.1 as reader model for both TextRAG and ImageRAG
# LLM-as-Judge evaluates answer quality (3x majority vote)

client = OpenAI()

def text_rag(question: str, retrieved_pages: list[str], k: int = 5) -> str:
    """TextRAG: OCR text -> GPT-4.1. Achieves 0.82 alignment at k=5."""
    context = "\n\n---PAGE BREAK---\n\n".join(retrieved_pages[:k])
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{
            "role": "user",
            "content": f"Answer based on these paper excerpts:\n\n"
                       f"{context}\n\nQuestion: {question}"
        }],
        max_tokens=500,
    )
    return response.choices[0].message.content

def image_rag(question: str, page_images_b64: list[str], k: int = 5) -> str:
    """ImageRAG: page images -> GPT-4.1. Achieves 0.71 alignment at k=5."""
    content = [{"type": "text", "text": f"Answer based on these paper pages:\n\nQuestion: {question}"}]
    for img in page_images_b64[:k]:
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/png;base64,{img}"}
        })
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": content}],
        max_tokens=500,
    )
    return response.choices[0].message.content

# Token costs scale with k:
# k=1: ~1,300 input tokens (both modalities)
# k=5: text=6,022 tokens, images=5,200 tokens (~11% more for text)

Question: "How does HyDE perform on Arguana compared to BM25 and ANCE?"

TextRAG (k=5):
  "HyDE achieves 46.6 nDCG@10 on Arguana, outperforming both BM25 (39.7)
   and ANCE (41.5). The improvement is attributed to..."
  → Alignment: TRUE (matches ground truth)

ImageRAG (k=5):
  "Based on the table shown, HyDE scores 46.6 on Arguana which exceeds
   the BM25 baseline of 39.7..."
  → Alignment: TRUE (matches ground truth)

ImageRAG (k=1):
  "The results table shows HyDE outperforms baselines on Arguana..."
  → Alignment: FALSE (missed exact numbers — table partially occluded)

Key Takeaways

1.Neither modality dominates — 22 queries need text, 18 need images. They have genuinely different failure modes.
2.Multimodal hybrid is the answer — RSF at α=0.25 achieves 49% R@1 and 95% R@20, beating both single modalities.
3.ColModernVBERT is surprisingly competitive — 250M params matches 2.9B ColPali at R@20 (93% both). 10x smaller model.
4.k=5 beats oracle k=1 — related pages add valuable context. Scientific QA benefits from synthesis across multiple sources.
5.Cohere Embed v4 leads by a wide margin — 58% R@1 vs 49% for best open-source. 9-point absolute gap on top-1 precision.
6.MUVERA trades quality for 50x storage savings — ef=1024 loses only 2% R@1 while compressing 1.65 GB to 33 MB.
7.Text RAG still produces better answers — 0.82 vs 0.71 alignment. Current VLMs lag behind text LLMs at extracting precise information.
8.OCR is cheap but slow — $54 for the corpus but 4 hours. Image encoding is instant and free. Both representations have value.

References

Related Guides

OCR Benchmarks

Compare OCR models for document processing

The Bitter Lesson

Why scale beats hand-engineering in AI research

Back to Guides