Level 3: Production~35 min

Reranking

Two-stage retrieval: fast recall first, then precise reranking. The single highest-leverage improvement you can make to any RAG pipeline.

25 Years of Learning to Rank

Reranking is not a new idea. It grew out of the "learning to rank" (LTR) field that emerged when web search engines realized that hand-tuned scoring functions could not keep up with the complexity of user queries. Each generation solved a fundamental limitation of the last, moving from engineered features to cross-attention to generative reasoning.

Understanding this progression is essential because every approach is still in production somewhere today. The choice between them is one of the most consequential architectural decisions in a retrieval system.

Era I: Feature-Based Learning to Rank
2000–2005

RankNet & Pairwise Learning

At Microsoft Research, Chris Burges and colleagues framed ranking as a machine learning problem. Instead of optimizing a classification loss, RankNet trained a neural network on pairs of documents: given query Q, should document A rank above document B? The model learned a scoring function from hand-crafted features (BM25 score, PageRank, URL depth, click-through rate, etc.) and was trained with a cross-entropy loss on pairwise preferences.

"We show that our cost function is related to a cross entropy cost and that it can be optimized with gradient descent."

Burges, C. et al. (2005). Learning to Rank using Gradient Descent. ICML.

RankNet established the paradigm: first-stage retrieval (BM25) produces candidates, then a learned ranker reorders them. This two-stage architecture is still the standard 20 years later. The limitation was that optimizing pairwise accuracy didn't directly optimize the metric search engines actually cared about — NDCG.

2007–2010

LambdaMART: The Industry Workhorse

Burges solved the NDCG-optimization problem with a brilliant trick: define "lambda gradients" that approximate the gradient of NDCG — a non-differentiable, position-dependent metric — and plug them into gradient-boosted decision trees (MART). The result, LambdaMART, won the Yahoo! Learning to Rank Challenge in 2010 and became the backbone of production search at Microsoft Bing, Yahoo, and Yandex.

# LambdaMART: gradient-boosted trees with lambda gradients
# Input: hand-crafted features per (query, document) pair
features = [
    bm25_score,        # Lexical match
    pagerank,          # Authority signal
    url_depth,         # Structural signal
    click_through_rate, # Behavioral signal
    query_doc_overlap, # Term match
    ...                # 500+ features in production
]
# Lambda gradient: how much would swapping doc_i and doc_j change NDCG?
lambda_ij = |delta_NDCG(i,j)| * sigmoid(score_i - score_j)
# Fit gradient-boosted trees to these gradients

Burges, C. (2010). From RankNet to LambdaRank to LambdaMART: An Overview. MSR Technical Report.

LambdaMART is still deployed at massive scale today. Its limitation is that it requires extensive feature engineering — hundreds of hand-crafted signals per query-document pair. When neural models learned to derive these features automatically from raw text, the next era began.

Era II: Neural Cross-Encoders
2019

BERT as a Cross-Encoder for Passage Reranking

Rodrigo Nogueira and Kyunghyun Cho at NYU demonstrated that fine-tuning BERT as a cross-encoder — feeding [CLS] query [SEP] passage [SEP] and training a binary classifier on top — dramatically outperformed all previous reranking methods on MS MARCO passage ranking. No feature engineering. No hand-crafted signals. Just raw text in, relevance score out.

"We show that BERT can be used as a neural ranker for passage re-ranking and obtain large improvements over the previous state-of-the-art."

Nogueira, R. & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.

This paper established the modern reranking paradigm: cross-attention between query and document is the key. A bi-encoder compresses each text into a single vector before comparison; a cross-encoder lets every query token attend to every document token through 12+ transformer layers. The result is vastly more precise relevance scoring — at the cost of O(n) inference per query rather than O(1) lookup.

2019

Sentence-BERT Crystallizes the Bi/Cross Tradeoff

Nils Reimers and Iryna Gurevych published Sentence-BERT, which made explicit the architectural tradeoff that defines all of modern retrieval. They measured: finding the most similar sentence pair in 10,000 sentences took 65 hours with a BERT cross-encoder (comparing all 50M pairs), but just 5 seconds with a Sentence-BERT bi-encoder (encode once, dot product).

This 47,000x speed difference is why two-stage retrieval exists. You cannot run a cross-encoder over your entire corpus. You must use a fast first stage (bi-encoder or BM25) to narrow candidates, then rerank the top-k with a cross-encoder.

Reimers, N. & Gurevych, I. (2019). Sentence-BERT. EMNLP.

2020–2023

ColBERT: Late Interaction as a Middle Ground

Omar Khattab and Matei Zaharia at Stanford proposed a third architecture that sits between bi-encoder and cross-encoder: late interaction. ColBERT encodes query and document independently (like a bi-encoder) but retains per-token embeddings instead of compressing to a single vector. At scoring time, it computes a "MaxSim" operation — for each query token, find the maximum cosine similarity with any document token, then sum.

# ColBERT late interaction scoring
# Query tokens: Q = [q_1, q_2, ..., q_m]  — each is a vector
# Doc tokens:   D = [d_1, d_2, ..., d_n]  — each is a vector
# Both encoded INDEPENDENTLY (like bi-encoder)

score = 0
for q_i in query_token_embeddings:
    max_sim = max(cosine_sim(q_i, d_j) for d_j in doc_token_embeddings)
    score += max_sim

# This "MaxSim" operation is much richer than single-vector dot product
# but much cheaper than full cross-attention

Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction. SIGIR.

ColBERTv2 (2022) added residual compression to reduce the per-token storage from 128 floats to ~2 bytes per token, making it practical for large corpora. The Stanford DSP (now DSPy) framework builds heavily on ColBERT retrieval. Late interaction represents a genuine architectural innovation — it's not just "bi-encoder but bigger."

Santhanam, K. et al. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL.

Era III: LLM-Based Reranking
2023

RankGPT: Reranking as Generative Reasoning

Sun, Yan, Ma, et al. showed that GPT-4 could rerank passages by reasoning about relevance in natural language. Instead of outputting a score, the LLM directly generates a permutation of document identifiers, ordered by relevance. The approach uses a sliding-window strategy: present 20 passages at a time, ask the LLM to sort them, then slide the window with a bubble-sort-like procedure.

"LLMs can effectively serve as zero-shot relevance rankers, outperforming supervised cross-encoder models on multiple benchmarks without any task-specific training."

Sun, W. et al. (2023). Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. EMNLP.

The trade-off is extreme: RankGPT achieves state-of-the-art relevance on TREC-DL and BEIR but costs 100–1000x more per query than a cross-encoder and takes seconds instead of milliseconds. In practice, it's used for offline evaluation, training data generation, and ultra-high-value queries where cost is not a constraint.

2023–2025

The Modern Reranker Arms Race

The field exploded with purpose-built reranking models that combine the precision of cross-encoders with the scale of modern training pipelines:

BGE Reranker v2.5

BAAI. LLM-based reranker using Gemma/Llama backbone. Multilingual. SOTA on BEIR.

Cohere Rerank 3.5

Production API. 4096 token context. 100+ languages. Best-in-class latency/quality.

Jina Reranker v2

8K context window. Code-aware. Multilingual. Open weights.

RankLLaMA & RankZephyr

Open-source LLM rerankers distilled from GPT-4 rankings. Competitive with commercial APIs.

The throughline: 2000 → 2026

Two decades. One insight refined relentlessly: first-stage retrieval optimizes for recall, second-stage reranking optimizes for precision.

2000–2010Features: Hand-crafted signals + gradient-boosted trees (RankNet, LambdaMART)
2019Cross-attention: BERT reads query and document together (Nogueira & Cho)
2020–2022Late interaction: Per-token matching without full cross-attention (ColBERT)
2023–nowGenerative: LLMs reason about relevance in natural language (RankGPT, RankLLaMA)

The Problem: Bi-Encoders Are Fast But Imprecise

In Lesson 0.1, you learned about embedding models (bi-encoders). They encode queries and documents separately into fixed-size vectors, then compare with a dot product. This architectural choice — independent encoding — is both the source of their speed and the root of their limitation.

The Information Bottleneck

Bi-encoder process:

Query:"python memory leak"[0.2, 0.8, ...]
Doc:"fixing memory issues"[0.3, 0.7, ...]

Each text compressed to a single 768-dim vector. All information about the relationship between query and document is lost — only what each text means independently is preserved.

Why this fails on nuanced queries:

"python memory leak" vs "python memory management" get similar scores despite one being about problems, the other about concepts
Negation: "foods that do NOT contain gluten" matches documents about gluten
Multi-part queries lose specificity when compressed to one vector

Deep Dive: Bi-Encoder vs Cross-Encoder Architecture

The bi-encoder vs cross-encoder distinction is the fundamental architectural decision in modern retrieval. Everything else — model size, training data, fine-tuning strategy — is secondary to this choice.

Side-by-Side Architecture Comparison

Bi-Encoder

Query → Encoder → v_q
Doc → Encoder → v_d
score = dot(v_q, v_d)

O(1) scoring. Docs pre-encoded.

Cross-Encoder

[CLS] query [SEP] doc [SEP]
↓ 12 layers of cross-attention
score = classifier([CLS])

O(n) scoring per query. Full attention.

Late Interaction (ColBERT)

Query → Encoder → [q_1..q_m]
Doc → Encoder → [d_1..d_n]
score = Σ max_sim(q_i, D)

Token-level matching. Docs pre-encoded.

Bi-Encoder (Stage 1: Retrieval)

  • +Sub-millisecond scoring (pre-computed embeddings + ANN index)
  • +Scales to billions of documents with HNSW/IVF indices
  • +Documents encoded once at ingestion time
  • Information bottleneck: entire document compressed to one vector
  • Cannot model fine-grained query-document interactions
  • Struggles with negation, multi-hop reasoning, long documents

Cross-Encoder (Stage 2: Reranking)

  • +Full cross-attention: every query token sees every document token
  • +Handles negation, specificity, multi-part queries
  • ++5–15% NDCG@10 over bi-encoder alone on most benchmarks
  • O(n) inference: must process every candidate document per query
  • Cannot pre-compute: document representation depends on query
  • Typically limited to reranking top 50–200 candidates

Why the performance gap exists

A bi-encoder must compress all information about a 512-token document into a single 768-dimensional vector before it knows what the query will ask about. A cross-encoder sees both simultaneously and can attend to different parts of the document depending on the query.

Consider the query "side effects of aspirin for dogs." A document about aspirin that mentions canine use in one paragraph will score well with a cross-encoder (which attends directly to that paragraph) but may score poorly with a bi-encoder (which must represent the entire document, diluting the dog-specific signal). This is the information-theoretic argument for reranking: cross-encoders have strictly more information at scoring time.

Humeau, S. et al. (2020). Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. ICLR.

Working Code: Three Approaches to Reranking

Here are production-ready implementations for the three most common reranking approaches.

1. Cross-Encoder with sentence-transformers

The simplest path. Open-source, runs locally, no API key needed.

# pip install sentence-transformers numpy
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

# Stage 1: Bi-encoder for initial retrieval
bi_encoder = SentenceTransformer('BAAI/bge-small-en-v1.5')

documents = [
    "Python memory leak debugging techniques",
    "JavaScript garbage collection explained",
    "How to profile memory usage in Python applications",
    "Understanding Python memory management internals",
    "Memory optimization strategies for large datasets",
    "Fixing out of memory errors in Python",
    "Python virtual memory and swap usage",
    "Common causes of memory leaks in web applications"
]

# Pre-compute document embeddings (done once at ingestion)
doc_embeddings = bi_encoder.encode(documents, normalize_embeddings=True)

# Query time
query = "how to find and fix python memory leaks"
query_embedding = bi_encoder.encode(query, normalize_embeddings=True)

# Stage 1: Fast retrieval via dot product
scores = np.dot(doc_embeddings, query_embedding)
top_k_indices = np.argsort(scores)[::-1][:5]

print("Stage 1 (Bi-encoder) ranking:")
for i, idx in enumerate(top_k_indices):
    print(f"  {i+1}. [{scores[idx]:.3f}] {documents[idx]}")

# Stage 2: Cross-encoder reranking
reranker = CrossEncoder('BAAI/bge-reranker-v2-m3')  # 568M params
candidates = [documents[i] for i in top_k_indices]
pairs = [[query, doc] for doc in candidates]
reranker_scores = reranker.predict(pairs)

# Sort by reranker scores
reranked = sorted(zip(candidates, reranker_scores), key=lambda x: x[1], reverse=True)

print("\nStage 2 (Cross-encoder) reranking:")
for i, (doc, score) in enumerate(reranked):
    print(f"  {i+1}. [{score:.3f}] {doc}")
Expected output:
Stage 1 (Bi-encoder) ranking:
  1. [0.821] Python memory leak debugging techniques
  2. [0.789] Understanding Python memory management internals
  3. [0.756] How to profile memory usage in Python applications
  4. [0.734] Fixing out of memory errors in Python
  5. [0.698] Memory optimization strategies for large datasets

Stage 2 (Cross-encoder) reranking:
  1. [0.967] Python memory leak debugging techniques
  2. [0.912] Fixing out of memory errors in Python
  3. [0.845] How to profile memory usage in Python applications
  4. [0.621] Common causes of memory leaks in web applications
  5. [0.398] Understanding Python memory management internals

The cross-encoder promotes "Fixing out of memory errors" from #4 to #2 because it can reason about the action implied by the query ("find and fix"). It also demotes "Understanding...internals" because that document is conceptual, not actionable.

2. Cohere Rerank API

Production-grade API with multilingual support, long-context handling, and managed infrastructure.

# pip install cohere
import cohere

co = cohere.ClientV2("your-api-key")

query = "how to find and fix python memory leaks"
documents = [
    "Python memory leak debugging techniques using tracemalloc",
    "JavaScript garbage collection explained: V8 engine internals",
    "How to profile memory usage in Python applications with memory_profiler",
    "Fixing out of memory errors in Python: practical solutions",
    "Understanding Python memory management: reference counting and gc module"
]

response = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=documents,
    top_n=3,              # Return only top 3
    return_documents=True
)

for result in response.results:
    print(f"  [{result.relevance_score:.3f}] (idx={result.index}) {result.document.text}")

# Output:
#   [0.982] (idx=0) Python memory leak debugging techniques using tracemalloc
#   [0.934] (idx=3) Fixing out of memory errors in Python: practical solutions
#   [0.891] (idx=2) How to profile memory usage in Python applications...

3. ColBERT Late Interaction with RAGatouille

ColBERT-style late interaction via the RAGatouille library. Pre-computes per-token embeddings for fast retrieval with richer matching than single-vector bi-encoders.

# pip install ragatouille
from ragatouille import RAGPretrainedModel

# Load ColBERTv2
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

# Index documents (computes per-token embeddings)
documents = [
    "Python memory leak debugging techniques using tracemalloc and objgraph",
    "JavaScript garbage collection explained: V8 engine internals",
    "How to profile memory usage in Python with memory_profiler and pympler",
    "Fixing out of memory errors in Python: practical debugging guide",
    "Understanding CPython memory management: reference counting, gc module"
]

RAG.index(
    collection=documents,
    index_name="memory_debugging",
    max_document_length=256,
    split_documents=True
)

# Search with late interaction (MaxSim scoring)
results = RAG.search(query="python memory leak debugging", k=3)

for r in results:
    print(f"  [{r['score']:.1f}] {r['content'][:80]}...")

# ColBERT's per-token matching catches "debugging" <-> "debugging" directly
# while also matching "leak" with contextually similar tokens in each doc

Benchmarks: Reranking Impact Across Datasets

Reranking consistently improves retrieval quality, but the magnitude varies dramatically by dataset and query type. The BEIR benchmark (Thakur et al., 2021) is the standard evaluation suite, testing zero-shot generalization across 18 diverse datasets.

MS MARCO Passage Ranking (NDCG@10)

The most widely-used benchmark for passage retrieval and reranking.

PipelineNDCG@10MRR@10Latency (p50)
BM25 only0.2280.1875ms
BGE-base bi-encoder0.3430.29415ms
ColBERTv2 (late interaction)0.3970.34940ms
BGE bi-encoder + MiniLM reranker0.3890.33580ms
BGE bi-encoder + BGE-reranker-v2-m30.4250.378120ms
Hybrid (BM25+vector) + Cohere Rerank 3.50.4410.392180ms
BM25 + RankGPT (GPT-4)0.4590.4123–8s

Sources: MTEB leaderboard, Nogueira & Cho (2019), Sun et al. (2023). Latencies measured on A100 GPU for local models, API round-trip for hosted services. Top-100 candidates reranked to top-10.

BEIR Zero-Shot Reranking (Average NDCG@10 across 13 datasets)

Zero-shot generalization: models tested on datasets they were not trained on. This measures real-world transfer.

RerankerAvg NDCG@10ParamsOpen Source
No reranker (BM25 baseline)0.440
cross-encoder/ms-marco-MiniLM-L-6-v20.49222MYes
BAAI/bge-reranker-large0.518560MYes
BAAI/bge-reranker-v2-m30.537568MYes
Jina Reranker v20.529278MYes
Cohere Rerank 3.50.548UndisclosedNo (API)
RankGPT (GPT-4o)0.556>100BNo (API)

Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS Datasets and Benchmarks.

Key Insight

The best cost-effective production pipeline in 2026 is: hybrid search (BM25 + vector) with RRF fusion, followed by a cross-encoder reranker. This three-stage approach — BM25 || vector → RRF fusion → rerank — achieves 90%+ of GPT-4 reranking quality at 1/1000th the cost and 50x lower latency.

The marginal improvement from LLM-based reranking (RankGPT) is real but small (+1–3% NDCG) and comes with 100x latency and cost penalty. Reserve it for offline evaluation and training data generation.

When Reranking Helps Most (and When to Skip It)

Reranking adds 50–200ms of latency and either compute cost (local models) or API cost (hosted services). The decision to include it should be empirical, not dogmatic.

Reranking Adds High Value When:

  • 1.

    RAG with LLM generation

    Better context = better LLM outputs. The reranking cost is tiny vs. LLM inference cost. This is the #1 use case.

  • 2.

    Long or heterogeneous documents

    Bi-encoders struggle to compress long documents into one vector. Cross-encoders attend to specific passages.

  • 3.

    Complex, multi-part queries

    Queries with conditions, negation, or multiple constraints ("papers on X but not Y from after 2020").

  • 4.

    High-stakes applications

    Legal, medical, financial search where precision directly impacts outcomes.

  • 5.

    Cross-lingual retrieval

    Rerankers like Cohere and Jina handle 100+ languages, improving multilingual matching significantly.

Skip Reranking When:

  • 1.

    Latency budget < 50ms

    Search-as-you-type, real-time autocomplete, gaming. The reranker alone takes 50ms+.

  • 2.

    Simple keyword-like queries

    Product catalog search, navigation queries. BM25 or bi-encoder is sufficient.

  • 3.

    Bi-encoder precision is already high

    Measure first. If top-5 bi-encoder results are already correct for your queries, reranking adds cost without value.

  • 4.

    Budget-constrained at extreme scale

    10M+ queries/day with tight margins. A 22M-param MiniLM reranker is cheap, but at massive scale even that adds up.

  • 5.

    Recall is the bottleneck, not precision

    If relevant documents are not in your top-100 candidates, reranking cannot help. Fix retrieval first.

Production Patterns

Battle-tested patterns for deploying reranking in production systems.

Pattern 1: Retrieve 100, Rerank to 10

The standard approach. Bi-encoder casts a wide net for recall, cross-encoder narrows for precision. The ratio matters: reranking top-20 misses good candidates; reranking top-500 wastes compute. Top-100 is the empirical sweet spot for most datasets.

# Stage 1: Get top 100 candidates (fast, ~15ms)
candidates = vector_search(query, k=100)

# Stage 2: Rerank to top 10 (slower, ~100ms for 100 pairs)
pairs = [(query, doc.text) for doc in candidates]
scores = reranker.predict(pairs)

# Return top 10 by reranker score
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return reranked[:10]

Pattern 2: Conditional Reranking

Only rerank when the query is complex or initial results show low confidence. Reduces compute by 40–60% with minimal quality loss.

import numpy as np

def should_rerank(query: str, initial_scores: list[float]) -> bool:
    """Decide whether reranking is worth the cost for this query."""
    # Complex query: multiple clauses or conditions
    if len(query.split()) > 8:
        return True
    # Low confidence: top result score is not clearly dominant
    if len(initial_scores) >= 2:
        gap = initial_scores[0] - initial_scores[1]
        if gap < 0.05:  # Top two results too close — ambiguous
            return True
    # Low absolute confidence
    if initial_scores[0] < 0.65:
        return True
    return False

# Usage in pipeline
candidates, scores = vector_search(query, k=100)
if should_rerank(query, scores[:10]):
    return rerank(query, candidates[:100])
else:
    return candidates[:10]  # Bi-encoder results good enough

Pattern 3: Score Calibration and Thresholding

Cross-encoder scores are uncalibrated logits, not probabilities. Normalize them before using score thresholds to filter irrelevant results.

from scipy.special import expit  # sigmoid

def calibrated_rerank(query: str, candidates: list[str], reranker) -> list:
    """Rerank with calibrated scores and relevance threshold."""
    pairs = [[query, doc] for doc in candidates]
    raw_scores = reranker.predict(pairs)

    # Sigmoid converts logits to 0-1 probabilities
    calibrated = expit(raw_scores)

    # Filter: only return documents above relevance threshold
    threshold = 0.5
    results = [
        (doc, score) for doc, score in zip(candidates, calibrated)
        if score >= threshold
    ]

    return sorted(results, key=lambda x: x[1], reverse=True)

Pattern 4: Multi-Stage Cascade

For maximum quality: BM25 + vector retrieval, fuse with RRF, then rerank. This is the architecture behind most production RAG systems in 2026.

def three_stage_retrieval(query: str, k: int = 10):
    # Stage 1a: BM25 retrieval (lexical)
    bm25_results = bm25_search(query, k=100)

    # Stage 1b: Vector retrieval (semantic) — runs in parallel
    vector_results = vector_search(query, k=100)

    # Stage 2: Reciprocal Rank Fusion
    fused = reciprocal_rank_fusion(
        [bm25_results, vector_results],
        k=60  # RRF constant
    )
    candidates = fused[:100]

    # Stage 3: Cross-encoder reranking
    reranked = rerank(query, candidates)
    return reranked[:k]

Reranker Model Guide (March 2026)

Choosing a reranker involves balancing quality, latency, cost, and deployment complexity.

BGE

BAAI/bge-reranker-v2-m3

Best open-source all-rounder. Multilingual (100+ languages). LLM-based architecture with cross-encoder scoring. The default recommendation for teams that want to self-host.

568M paramsBEIR avg: 0.537~50ms/pair (A100)Free, open source
Cohere

Cohere Rerank 3.5

Production-grade API. Best quality-to-latency ratio for teams that want zero infrastructure. 4096 token context. Semi-structured document support. Battle-tested at scale.

Hosted APIBEIR avg: ~0.548~80ms latency$2/1000 searches
Mini

cross-encoder/ms-marco-MiniLM-L-6-v2

The lightweight workhorse. Only 22M parameters — runs on CPU in production. Quality is lower than larger models but latency is excellent. Best choice for cost-sensitive deployments or when GPU is not available.

22M paramsBEIR avg: 0.492~8ms/pair (GPU), ~25ms (CPU)Free, open source
Jina

Jina Reranker v2 Base Multilingual

Multilingual reranker with 8K context window and code-awareness. Open weights. Strong on technical/code search tasks where other rerankers underperform.

278M paramsBEIR avg: 0.529Multilingual + Code8K context
ColB

ColBERTv2

Late interaction model — not a traditional reranker but an alternative architecture. Per-token embeddings enable richer matching than bi-encoders while maintaining pre-computation. Best when you need bi-encoder-like latency with cross-encoder-like quality.

110M paramsMS MARCO: 0.397Requires special index (PLAID)Free, open source
Common Mistakes

Five Reranking Anti-Patterns

1. Reranking too few candidates

If you retrieve top-10 and rerank top-10, the reranker can only reorder what you already have. It cannot surface documents that were missed. Retrieve at least 5–10x your final k. Want 10 results? Retrieve 100. Nogueira & Cho (2019) showed that reranking top-1000 gives the best quality, but top-100 captures most of the gain.

2. Treating reranker scores as probabilities

Most cross-encoders output raw logits, not calibrated probabilities. A score of 3.2 does not mean "95% relevant." If you need thresholds, apply sigmoid calibration. If you need to compare scores across queries, normalize per-query.

3. Not chunking long documents before reranking

Most cross-encoders have 512-token context windows. Feeding a 5000-word document will silently truncate it. Chunk documents to passage-length (256–512 tokens) before reranking, then take the max score per document if you need document-level ranking.

4. Using a reranker trained on a different domain

MS MARCO is web search queries. If your use case is legal document retrieval, biomedical search, or code search, a general MS MARCO reranker may underperform. Fine-tune on domain data, or at minimum evaluate on a domain-representative test set before deploying.

5. Skipping A/B testing

Offline metrics (NDCG, MRR) correlate with but do not guarantee online improvements. A reranker that improves NDCG@10 by 5% might not change user behavior if the original top-3 results were already good enough. Always A/B test reranking with real users before committing to the added infrastructure.

Key Takeaways

  • 1

    The bi-encoder bottleneck is real — Compressing a document to one vector before seeing the query loses critical information. Cross-encoders recover it through full cross-attention.

  • 2

    Two-stage retrieval is not optional for production RAG — Retrieve top 100 with bi-encoder, rerank to top 10 with cross-encoder. The 5–15% NDCG improvement translates directly to better LLM outputs.

  • 3

    ColBERT offers a middle ground — Late interaction (per-token matching) gives cross-encoder-like quality with bi-encoder-like pre-computation. Use it when latency budgets are tight.

  • 4

    Start with BGE-reranker-v2-m3 or Cohere Rerank — Open-source: bge-reranker-v2-m3 (best BEIR score, multilingual). API: Cohere Rerank 3.5 (best quality-to-latency, zero infrastructure). Scale down to MiniLM if cost is a constraint.

  • 5

    Measure before deploying — Reranking is not universally beneficial. If your bi-encoder already produces good top-5 results for your query distribution, the added latency and cost may not be justified. Build an evaluation set. Measure NDCG@10. Decide with data.

References & Further Reading

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.