Reranking
Two-stage retrieval: fast recall first, then precise reranking. The single highest-leverage improvement you can make to any RAG pipeline.
25 Years of Learning to Rank
Reranking is not a new idea. It grew out of the "learning to rank" (LTR) field that emerged when web search engines realized that hand-tuned scoring functions could not keep up with the complexity of user queries. Each generation solved a fundamental limitation of the last, moving from engineered features to cross-attention to generative reasoning.
Understanding this progression is essential because every approach is still in production somewhere today. The choice between them is one of the most consequential architectural decisions in a retrieval system.
RankNet & Pairwise Learning
At Microsoft Research, Chris Burges and colleagues framed ranking as a machine learning problem. Instead of optimizing a classification loss, RankNet trained a neural network on pairs of documents: given query Q, should document A rank above document B? The model learned a scoring function from hand-crafted features (BM25 score, PageRank, URL depth, click-through rate, etc.) and was trained with a cross-entropy loss on pairwise preferences.
"We show that our cost function is related to a cross entropy cost and that it can be optimized with gradient descent."
— Burges, C. et al. (2005). Learning to Rank using Gradient Descent. ICML.
RankNet established the paradigm: first-stage retrieval (BM25) produces candidates, then a learned ranker reorders them. This two-stage architecture is still the standard 20 years later. The limitation was that optimizing pairwise accuracy didn't directly optimize the metric search engines actually cared about — NDCG.
LambdaMART: The Industry Workhorse
Burges solved the NDCG-optimization problem with a brilliant trick: define "lambda gradients" that approximate the gradient of NDCG — a non-differentiable, position-dependent metric — and plug them into gradient-boosted decision trees (MART). The result, LambdaMART, won the Yahoo! Learning to Rank Challenge in 2010 and became the backbone of production search at Microsoft Bing, Yahoo, and Yandex.
# LambdaMART: gradient-boosted trees with lambda gradients
# Input: hand-crafted features per (query, document) pair
features = [
bm25_score, # Lexical match
pagerank, # Authority signal
url_depth, # Structural signal
click_through_rate, # Behavioral signal
query_doc_overlap, # Term match
... # 500+ features in production
]
# Lambda gradient: how much would swapping doc_i and doc_j change NDCG?
lambda_ij = |delta_NDCG(i,j)| * sigmoid(score_i - score_j)
# Fit gradient-boosted trees to these gradients— Burges, C. (2010). From RankNet to LambdaRank to LambdaMART: An Overview. MSR Technical Report.
LambdaMART is still deployed at massive scale today. Its limitation is that it requires extensive feature engineering — hundreds of hand-crafted signals per query-document pair. When neural models learned to derive these features automatically from raw text, the next era began.
BERT as a Cross-Encoder for Passage Reranking
Rodrigo Nogueira and Kyunghyun Cho at NYU demonstrated that fine-tuning BERT as a cross-encoder — feeding [CLS] query [SEP] passage [SEP] and training a binary classifier on top — dramatically outperformed all previous reranking methods on MS MARCO passage ranking. No feature engineering. No hand-crafted signals. Just raw text in, relevance score out.
"We show that BERT can be used as a neural ranker for passage re-ranking and obtain large improvements over the previous state-of-the-art."
— Nogueira, R. & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.
This paper established the modern reranking paradigm: cross-attention between query and document is the key. A bi-encoder compresses each text into a single vector before comparison; a cross-encoder lets every query token attend to every document token through 12+ transformer layers. The result is vastly more precise relevance scoring — at the cost of O(n) inference per query rather than O(1) lookup.
Sentence-BERT Crystallizes the Bi/Cross Tradeoff
Nils Reimers and Iryna Gurevych published Sentence-BERT, which made explicit the architectural tradeoff that defines all of modern retrieval. They measured: finding the most similar sentence pair in 10,000 sentences took 65 hours with a BERT cross-encoder (comparing all 50M pairs), but just 5 seconds with a Sentence-BERT bi-encoder (encode once, dot product).
This 47,000x speed difference is why two-stage retrieval exists. You cannot run a cross-encoder over your entire corpus. You must use a fast first stage (bi-encoder or BM25) to narrow candidates, then rerank the top-k with a cross-encoder.
ColBERT: Late Interaction as a Middle Ground
Omar Khattab and Matei Zaharia at Stanford proposed a third architecture that sits between bi-encoder and cross-encoder: late interaction. ColBERT encodes query and document independently (like a bi-encoder) but retains per-token embeddings instead of compressing to a single vector. At scoring time, it computes a "MaxSim" operation — for each query token, find the maximum cosine similarity with any document token, then sum.
# ColBERT late interaction scoring
# Query tokens: Q = [q_1, q_2, ..., q_m] — each is a vector
# Doc tokens: D = [d_1, d_2, ..., d_n] — each is a vector
# Both encoded INDEPENDENTLY (like bi-encoder)
score = 0
for q_i in query_token_embeddings:
max_sim = max(cosine_sim(q_i, d_j) for d_j in doc_token_embeddings)
score += max_sim
# This "MaxSim" operation is much richer than single-vector dot product
# but much cheaper than full cross-attentionColBERTv2 (2022) added residual compression to reduce the per-token storage from 128 floats to ~2 bytes per token, making it practical for large corpora. The Stanford DSP (now DSPy) framework builds heavily on ColBERT retrieval. Late interaction represents a genuine architectural innovation — it's not just "bi-encoder but bigger."
RankGPT: Reranking as Generative Reasoning
Sun, Yan, Ma, et al. showed that GPT-4 could rerank passages by reasoning about relevance in natural language. Instead of outputting a score, the LLM directly generates a permutation of document identifiers, ordered by relevance. The approach uses a sliding-window strategy: present 20 passages at a time, ask the LLM to sort them, then slide the window with a bubble-sort-like procedure.
"LLMs can effectively serve as zero-shot relevance rankers, outperforming supervised cross-encoder models on multiple benchmarks without any task-specific training."
The trade-off is extreme: RankGPT achieves state-of-the-art relevance on TREC-DL and BEIR but costs 100–1000x more per query than a cross-encoder and takes seconds instead of milliseconds. In practice, it's used for offline evaluation, training data generation, and ultra-high-value queries where cost is not a constraint.
The Modern Reranker Arms Race
The field exploded with purpose-built reranking models that combine the precision of cross-encoders with the scale of modern training pipelines:
BGE Reranker v2.5
BAAI. LLM-based reranker using Gemma/Llama backbone. Multilingual. SOTA on BEIR.
Cohere Rerank 3.5
Production API. 4096 token context. 100+ languages. Best-in-class latency/quality.
Jina Reranker v2
8K context window. Code-aware. Multilingual. Open weights.
RankLLaMA & RankZephyr
Open-source LLM rerankers distilled from GPT-4 rankings. Competitive with commercial APIs.
The throughline: 2000 → 2026
Two decades. One insight refined relentlessly: first-stage retrieval optimizes for recall, second-stage reranking optimizes for precision.
The Problem: Bi-Encoders Are Fast But Imprecise
In Lesson 0.1, you learned about embedding models (bi-encoders). They encode queries and documents separately into fixed-size vectors, then compare with a dot product. This architectural choice — independent encoding — is both the source of their speed and the root of their limitation.
The Information Bottleneck
Bi-encoder process:
Each text compressed to a single 768-dim vector. All information about the relationship between query and document is lost — only what each text means independently is preserved.
Why this fails on nuanced queries:
Deep Dive: Bi-Encoder vs Cross-Encoder Architecture
The bi-encoder vs cross-encoder distinction is the fundamental architectural decision in modern retrieval. Everything else — model size, training data, fine-tuning strategy — is secondary to this choice.
Side-by-Side Architecture Comparison
Bi-Encoder
O(1) scoring. Docs pre-encoded.
Cross-Encoder
O(n) scoring per query. Full attention.
Late Interaction (ColBERT)
Token-level matching. Docs pre-encoded.
Bi-Encoder (Stage 1: Retrieval)
- +Sub-millisecond scoring (pre-computed embeddings + ANN index)
- +Scales to billions of documents with HNSW/IVF indices
- +Documents encoded once at ingestion time
- −Information bottleneck: entire document compressed to one vector
- −Cannot model fine-grained query-document interactions
- −Struggles with negation, multi-hop reasoning, long documents
Cross-Encoder (Stage 2: Reranking)
- +Full cross-attention: every query token sees every document token
- +Handles negation, specificity, multi-part queries
- ++5–15% NDCG@10 over bi-encoder alone on most benchmarks
- −O(n) inference: must process every candidate document per query
- −Cannot pre-compute: document representation depends on query
- −Typically limited to reranking top 50–200 candidates
Why the performance gap exists
A bi-encoder must compress all information about a 512-token document into a single 768-dimensional vector before it knows what the query will ask about. A cross-encoder sees both simultaneously and can attend to different parts of the document depending on the query.
Consider the query "side effects of aspirin for dogs." A document about aspirin that mentions canine use in one paragraph will score well with a cross-encoder (which attends directly to that paragraph) but may score poorly with a bi-encoder (which must represent the entire document, diluting the dog-specific signal). This is the information-theoretic argument for reranking: cross-encoders have strictly more information at scoring time.
Working Code: Three Approaches to Reranking
Here are production-ready implementations for the three most common reranking approaches.
1. Cross-Encoder with sentence-transformers
The simplest path. Open-source, runs locally, no API key needed.
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
# Stage 1: Bi-encoder for initial retrieval
bi_encoder = SentenceTransformer('BAAI/bge-small-en-v1.5')
documents = [
"Python memory leak debugging techniques",
"JavaScript garbage collection explained",
"How to profile memory usage in Python applications",
"Understanding Python memory management internals",
"Memory optimization strategies for large datasets",
"Fixing out of memory errors in Python",
"Python virtual memory and swap usage",
"Common causes of memory leaks in web applications"
]
# Pre-compute document embeddings (done once at ingestion)
doc_embeddings = bi_encoder.encode(documents, normalize_embeddings=True)
# Query time
query = "how to find and fix python memory leaks"
query_embedding = bi_encoder.encode(query, normalize_embeddings=True)
# Stage 1: Fast retrieval via dot product
scores = np.dot(doc_embeddings, query_embedding)
top_k_indices = np.argsort(scores)[::-1][:5]
print("Stage 1 (Bi-encoder) ranking:")
for i, idx in enumerate(top_k_indices):
print(f" {i+1}. [{scores[idx]:.3f}] {documents[idx]}")
# Stage 2: Cross-encoder reranking
reranker = CrossEncoder('BAAI/bge-reranker-v2-m3') # 568M params
candidates = [documents[i] for i in top_k_indices]
pairs = [[query, doc] for doc in candidates]
reranker_scores = reranker.predict(pairs)
# Sort by reranker scores
reranked = sorted(zip(candidates, reranker_scores), key=lambda x: x[1], reverse=True)
print("\nStage 2 (Cross-encoder) reranking:")
for i, (doc, score) in enumerate(reranked):
print(f" {i+1}. [{score:.3f}] {doc}")Stage 1 (Bi-encoder) ranking: 1. [0.821] Python memory leak debugging techniques 2. [0.789] Understanding Python memory management internals 3. [0.756] How to profile memory usage in Python applications 4. [0.734] Fixing out of memory errors in Python 5. [0.698] Memory optimization strategies for large datasets Stage 2 (Cross-encoder) reranking: 1. [0.967] Python memory leak debugging techniques 2. [0.912] Fixing out of memory errors in Python 3. [0.845] How to profile memory usage in Python applications 4. [0.621] Common causes of memory leaks in web applications 5. [0.398] Understanding Python memory management internals
The cross-encoder promotes "Fixing out of memory errors" from #4 to #2 because it can reason about the action implied by the query ("find and fix"). It also demotes "Understanding...internals" because that document is conceptual, not actionable.
2. Cohere Rerank API
Production-grade API with multilingual support, long-context handling, and managed infrastructure.
import cohere
co = cohere.ClientV2("your-api-key")
query = "how to find and fix python memory leaks"
documents = [
"Python memory leak debugging techniques using tracemalloc",
"JavaScript garbage collection explained: V8 engine internals",
"How to profile memory usage in Python applications with memory_profiler",
"Fixing out of memory errors in Python: practical solutions",
"Understanding Python memory management: reference counting and gc module"
]
response = co.rerank(
model="rerank-v3.5",
query=query,
documents=documents,
top_n=3, # Return only top 3
return_documents=True
)
for result in response.results:
print(f" [{result.relevance_score:.3f}] (idx={result.index}) {result.document.text}")
# Output:
# [0.982] (idx=0) Python memory leak debugging techniques using tracemalloc
# [0.934] (idx=3) Fixing out of memory errors in Python: practical solutions
# [0.891] (idx=2) How to profile memory usage in Python applications...3. ColBERT Late Interaction with RAGatouille
ColBERT-style late interaction via the RAGatouille library. Pre-computes per-token embeddings for fast retrieval with richer matching than single-vector bi-encoders.
from ragatouille import RAGPretrainedModel
# Load ColBERTv2
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
# Index documents (computes per-token embeddings)
documents = [
"Python memory leak debugging techniques using tracemalloc and objgraph",
"JavaScript garbage collection explained: V8 engine internals",
"How to profile memory usage in Python with memory_profiler and pympler",
"Fixing out of memory errors in Python: practical debugging guide",
"Understanding CPython memory management: reference counting, gc module"
]
RAG.index(
collection=documents,
index_name="memory_debugging",
max_document_length=256,
split_documents=True
)
# Search with late interaction (MaxSim scoring)
results = RAG.search(query="python memory leak debugging", k=3)
for r in results:
print(f" [{r['score']:.1f}] {r['content'][:80]}...")
# ColBERT's per-token matching catches "debugging" <-> "debugging" directly
# while also matching "leak" with contextually similar tokens in each docBenchmarks: Reranking Impact Across Datasets
Reranking consistently improves retrieval quality, but the magnitude varies dramatically by dataset and query type. The BEIR benchmark (Thakur et al., 2021) is the standard evaluation suite, testing zero-shot generalization across 18 diverse datasets.
MS MARCO Passage Ranking (NDCG@10)
The most widely-used benchmark for passage retrieval and reranking.
| Pipeline | NDCG@10 | MRR@10 | Latency (p50) |
|---|---|---|---|
| BM25 only | 0.228 | 0.187 | 5ms |
| BGE-base bi-encoder | 0.343 | 0.294 | 15ms |
| ColBERTv2 (late interaction) | 0.397 | 0.349 | 40ms |
| BGE bi-encoder + MiniLM reranker | 0.389 | 0.335 | 80ms |
| BGE bi-encoder + BGE-reranker-v2-m3 | 0.425 | 0.378 | 120ms |
| Hybrid (BM25+vector) + Cohere Rerank 3.5 | 0.441 | 0.392 | 180ms |
| BM25 + RankGPT (GPT-4) | 0.459 | 0.412 | 3–8s |
Sources: MTEB leaderboard, Nogueira & Cho (2019), Sun et al. (2023). Latencies measured on A100 GPU for local models, API round-trip for hosted services. Top-100 candidates reranked to top-10.
BEIR Zero-Shot Reranking (Average NDCG@10 across 13 datasets)
Zero-shot generalization: models tested on datasets they were not trained on. This measures real-world transfer.
| Reranker | Avg NDCG@10 | Params | Open Source |
|---|---|---|---|
| No reranker (BM25 baseline) | 0.440 | — | — |
| cross-encoder/ms-marco-MiniLM-L-6-v2 | 0.492 | 22M | Yes |
| BAAI/bge-reranker-large | 0.518 | 560M | Yes |
| BAAI/bge-reranker-v2-m3 | 0.537 | 568M | Yes |
| Jina Reranker v2 | 0.529 | 278M | Yes |
| Cohere Rerank 3.5 | 0.548 | Undisclosed | No (API) |
| RankGPT (GPT-4o) | 0.556 | >100B | No (API) |
Key Insight
The best cost-effective production pipeline in 2026 is: hybrid search (BM25 + vector) with RRF fusion, followed by a cross-encoder reranker. This three-stage approach — BM25 || vector → RRF fusion → rerank — achieves 90%+ of GPT-4 reranking quality at 1/1000th the cost and 50x lower latency.
The marginal improvement from LLM-based reranking (RankGPT) is real but small (+1–3% NDCG) and comes with 100x latency and cost penalty. Reserve it for offline evaluation and training data generation.
When Reranking Helps Most (and When to Skip It)
Reranking adds 50–200ms of latency and either compute cost (local models) or API cost (hosted services). The decision to include it should be empirical, not dogmatic.
Reranking Adds High Value When:
- 1.
RAG with LLM generation
Better context = better LLM outputs. The reranking cost is tiny vs. LLM inference cost. This is the #1 use case.
- 2.
Long or heterogeneous documents
Bi-encoders struggle to compress long documents into one vector. Cross-encoders attend to specific passages.
- 3.
Complex, multi-part queries
Queries with conditions, negation, or multiple constraints ("papers on X but not Y from after 2020").
- 4.
High-stakes applications
Legal, medical, financial search where precision directly impacts outcomes.
- 5.
Cross-lingual retrieval
Rerankers like Cohere and Jina handle 100+ languages, improving multilingual matching significantly.
Skip Reranking When:
- 1.
Latency budget < 50ms
Search-as-you-type, real-time autocomplete, gaming. The reranker alone takes 50ms+.
- 2.
Simple keyword-like queries
Product catalog search, navigation queries. BM25 or bi-encoder is sufficient.
- 3.
Bi-encoder precision is already high
Measure first. If top-5 bi-encoder results are already correct for your queries, reranking adds cost without value.
- 4.
Budget-constrained at extreme scale
10M+ queries/day with tight margins. A 22M-param MiniLM reranker is cheap, but at massive scale even that adds up.
- 5.
Recall is the bottleneck, not precision
If relevant documents are not in your top-100 candidates, reranking cannot help. Fix retrieval first.
Production Patterns
Battle-tested patterns for deploying reranking in production systems.
Pattern 1: Retrieve 100, Rerank to 10
The standard approach. Bi-encoder casts a wide net for recall, cross-encoder narrows for precision. The ratio matters: reranking top-20 misses good candidates; reranking top-500 wastes compute. Top-100 is the empirical sweet spot for most datasets.
# Stage 1: Get top 100 candidates (fast, ~15ms) candidates = vector_search(query, k=100) # Stage 2: Rerank to top 10 (slower, ~100ms for 100 pairs) pairs = [(query, doc.text) for doc in candidates] scores = reranker.predict(pairs) # Return top 10 by reranker score reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) return reranked[:10]
Pattern 2: Conditional Reranking
Only rerank when the query is complex or initial results show low confidence. Reduces compute by 40–60% with minimal quality loss.
import numpy as np
def should_rerank(query: str, initial_scores: list[float]) -> bool:
"""Decide whether reranking is worth the cost for this query."""
# Complex query: multiple clauses or conditions
if len(query.split()) > 8:
return True
# Low confidence: top result score is not clearly dominant
if len(initial_scores) >= 2:
gap = initial_scores[0] - initial_scores[1]
if gap < 0.05: # Top two results too close — ambiguous
return True
# Low absolute confidence
if initial_scores[0] < 0.65:
return True
return False
# Usage in pipeline
candidates, scores = vector_search(query, k=100)
if should_rerank(query, scores[:10]):
return rerank(query, candidates[:100])
else:
return candidates[:10] # Bi-encoder results good enoughPattern 3: Score Calibration and Thresholding
Cross-encoder scores are uncalibrated logits, not probabilities. Normalize them before using score thresholds to filter irrelevant results.
from scipy.special import expit # sigmoid
def calibrated_rerank(query: str, candidates: list[str], reranker) -> list:
"""Rerank with calibrated scores and relevance threshold."""
pairs = [[query, doc] for doc in candidates]
raw_scores = reranker.predict(pairs)
# Sigmoid converts logits to 0-1 probabilities
calibrated = expit(raw_scores)
# Filter: only return documents above relevance threshold
threshold = 0.5
results = [
(doc, score) for doc, score in zip(candidates, calibrated)
if score >= threshold
]
return sorted(results, key=lambda x: x[1], reverse=True)Pattern 4: Multi-Stage Cascade
For maximum quality: BM25 + vector retrieval, fuse with RRF, then rerank. This is the architecture behind most production RAG systems in 2026.
def three_stage_retrieval(query: str, k: int = 10):
# Stage 1a: BM25 retrieval (lexical)
bm25_results = bm25_search(query, k=100)
# Stage 1b: Vector retrieval (semantic) — runs in parallel
vector_results = vector_search(query, k=100)
# Stage 2: Reciprocal Rank Fusion
fused = reciprocal_rank_fusion(
[bm25_results, vector_results],
k=60 # RRF constant
)
candidates = fused[:100]
# Stage 3: Cross-encoder reranking
reranked = rerank(query, candidates)
return reranked[:k]Reranker Model Guide (March 2026)
Choosing a reranker involves balancing quality, latency, cost, and deployment complexity.
BAAI/bge-reranker-v2-m3
Best open-source all-rounder. Multilingual (100+ languages). LLM-based architecture with cross-encoder scoring. The default recommendation for teams that want to self-host.
Cohere Rerank 3.5
Production-grade API. Best quality-to-latency ratio for teams that want zero infrastructure. 4096 token context. Semi-structured document support. Battle-tested at scale.
cross-encoder/ms-marco-MiniLM-L-6-v2
The lightweight workhorse. Only 22M parameters — runs on CPU in production. Quality is lower than larger models but latency is excellent. Best choice for cost-sensitive deployments or when GPU is not available.
Jina Reranker v2 Base Multilingual
Multilingual reranker with 8K context window and code-awareness. Open weights. Strong on technical/code search tasks where other rerankers underperform.
ColBERTv2
Late interaction model — not a traditional reranker but an alternative architecture. Per-token embeddings enable richer matching than bi-encoders while maintaining pre-computation. Best when you need bi-encoder-like latency with cross-encoder-like quality.
Five Reranking Anti-Patterns
1. Reranking too few candidates
If you retrieve top-10 and rerank top-10, the reranker can only reorder what you already have. It cannot surface documents that were missed. Retrieve at least 5–10x your final k. Want 10 results? Retrieve 100. Nogueira & Cho (2019) showed that reranking top-1000 gives the best quality, but top-100 captures most of the gain.
2. Treating reranker scores as probabilities
Most cross-encoders output raw logits, not calibrated probabilities. A score of 3.2 does not mean "95% relevant." If you need thresholds, apply sigmoid calibration. If you need to compare scores across queries, normalize per-query.
3. Not chunking long documents before reranking
Most cross-encoders have 512-token context windows. Feeding a 5000-word document will silently truncate it. Chunk documents to passage-length (256–512 tokens) before reranking, then take the max score per document if you need document-level ranking.
4. Using a reranker trained on a different domain
MS MARCO is web search queries. If your use case is legal document retrieval, biomedical search, or code search, a general MS MARCO reranker may underperform. Fine-tune on domain data, or at minimum evaluate on a domain-representative test set before deploying.
5. Skipping A/B testing
Offline metrics (NDCG, MRR) correlate with but do not guarantee online improvements. A reranker that improves NDCG@10 by 5% might not change user behavior if the original top-3 results were already good enough. Always A/B test reranking with real users before committing to the added infrastructure.
Key Takeaways
- 1
The bi-encoder bottleneck is real — Compressing a document to one vector before seeing the query loses critical information. Cross-encoders recover it through full cross-attention.
- 2
Two-stage retrieval is not optional for production RAG — Retrieve top 100 with bi-encoder, rerank to top 10 with cross-encoder. The 5–15% NDCG improvement translates directly to better LLM outputs.
- 3
ColBERT offers a middle ground — Late interaction (per-token matching) gives cross-encoder-like quality with bi-encoder-like pre-computation. Use it when latency budgets are tight.
- 4
Start with BGE-reranker-v2-m3 or Cohere Rerank — Open-source: bge-reranker-v2-m3 (best BEIR score, multilingual). API: Cohere Rerank 3.5 (best quality-to-latency, zero infrastructure). Scale down to MiniLM if cost is a constraint.
- 5
Measure before deploying — Reranking is not universally beneficial. If your bi-encoder already produces good top-5 results for your query distribution, the added latency and cost may not be justified. Build an evaluation set. Measure NDCG@10. Decide with data.
References & Further Reading
- Nogueira, R. & Cho, K. (2019). Passage Re-ranking with BERT. — The paper that established neural cross-encoder reranking.
- Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Late Interaction. SIGIR.
- Santhanam, K. et al. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL.
- Sun, W. et al. (2023). Is ChatGPT Good at Search? Investigating LLMs as Re-Ranking Agents. EMNLP.
- Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of IR Models. NeurIPS.
- Burges, C. (2010). From RankNet to LambdaRank to LambdaMART: An Overview. — The definitive guide to feature-based LTR.
- Reimers, N. & Gurevych, I. (2019). Sentence-BERT. EMNLP. — Crystallized the bi-encoder vs cross-encoder tradeoff.
- Ma, X. et al. (2023). Fine-Tuning LLaMA for Multi-Stage Text Retrieval. arXiv. — RankLLaMA: open-source LLM reranking.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.