Reranking
Two-stage retrieval: fast recall first, then precise reranking. The secret to production RAG quality.
The Problem: Bi-encoders Are Fast But Imprecise
In Lesson 1.1, you learned about embedding models (bi-encoders). They encode queries and documents separately, then compare with dot product. This is fast but misses nuanced relationships between query and document.
Bi-encoder Limitation: No Cross-Attention
Bi-encoder process:
Encoded independently, compared with dot product.
The problem:
Query and document never "see" each other during encoding. The model cannot reason about how they relate.
"python memory leak" and "python memory management" get similar scores even though one is about problems, the other about concepts.
The Solution: Cross-Encoder Reranking
A cross-encoder processes query and document together, allowing full attention between them. It outputs a single relevance score.
Cross-Encoder Process
Input
"[CLS] query [SEP] document [SEP]"
Model
Cross-Encoder
Full attention
Output
0.87
Relevance score
Bi-encoder (Stage 1)
- +Extremely fast (embeddings pre-computed)
- +Scales to millions of documents
- +Good for broad recall (top 100)
- -Limited precision for nuanced queries
Cross-encoder (Stage 2)
- +High precision relevance scoring
- +Understands query-document relationships
- +Great for ranking top candidates
- -Slow: O(n) inference per query
Implementation: Two-Stage Retrieval
The standard pattern: retrieve top-k candidates with bi-encoder, then rerank with cross-encoder.
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
# Stage 1: Bi-encoder for initial retrieval
bi_encoder = SentenceTransformer('BAAI/bge-small-en-v1.5')
documents = [
"Python memory leak debugging techniques",
"JavaScript garbage collection explained",
"How to profile memory usage in Python applications",
"Understanding Python memory management internals",
"Memory optimization strategies for large datasets",
"Fixing out of memory errors in Python",
"Python virtual memory and swap usage",
"Common causes of memory leaks in web applications"
]
# Pre-compute document embeddings
doc_embeddings = bi_encoder.encode(documents, normalize_embeddings=True)
# Query
query = "python memory leak"
query_embedding = bi_encoder.encode(query, normalize_embeddings=True)
# Get top-100 candidates (in production, this would be more)
scores = np.dot(doc_embeddings, query_embedding)
top_k_indices = np.argsort(scores)[::-1][:100]
candidates = [documents[i] for i in top_k_indices]
print("Stage 1 (Bi-encoder) - Top 5:")
for i, idx in enumerate(top_k_indices[:5]):
print(f" {scores[idx]:.3f}: {documents[idx]}")
# Stage 2: Cross-encoder reranking
reranker = CrossEncoder('BAAI/bge-reranker-large')
# Create query-document pairs
pairs = [[query, doc] for doc in candidates]
# Get reranker scores
reranker_scores = reranker.predict(pairs)
# Sort by reranker scores
reranked_indices = np.argsort(reranker_scores)[::-1]
print("\nStage 2 (Cross-encoder) - Top 5:")
for i in reranked_indices[:5]:
print(f" {reranker_scores[i]:.3f}: {candidates[i]}")Stage 1 (Bi-encoder) - Top 5: 0.821: Python memory leak debugging techniques 0.789: Understanding Python memory management internals 0.756: How to profile memory usage in Python applications 0.734: Fixing out of memory errors in Python 0.698: Memory optimization strategies for large datasets Stage 2 (Cross-encoder) - Top 5: 0.967: Python memory leak debugging techniques 0.891: Fixing out of memory errors in Python 0.823: How to profile memory usage in Python applications 0.654: Understanding Python memory management internals 0.432: Common causes of memory leaks in web applications
Note: Cross-encoder promotes "Fixing out of memory errors" (more action-oriented match for "leak").
Popular Reranker Models
Several high-quality reranker models are available, both open source and API-based.
BAAI/bge-reranker-large
Best open-source reranker. Trained on MS MARCO. 560M parameters.
Cohere Rerank
Production-grade API. Multilingual support. Very high quality.
cross-encoder/ms-marco-MiniLM-L-6-v2
Lightweight, fast. Good balance of speed and quality.
jina-reranker-v2-base-multilingual
Multilingual reranker. 100+ languages supported.
import cohere
co = cohere.Client("your-api-key")
query = "python memory leak"
documents = [
"Python memory leak debugging techniques",
"JavaScript garbage collection explained",
"Fixing out of memory errors in Python"
]
response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=documents,
top_n=3,
return_documents=True
)
for result in response.results:
print(f"{result.relevance_score:.3f}: {result.document.text}")When Reranking Helps Most
Reranking adds latency and cost. It's most valuable in specific scenarios:
High Value Scenarios
- 1.
Long documents
Bi-encoders struggle with long text. Cross-encoders handle full context.
- 2.
Complex queries
Multi-part questions, negation, specific requirements.
- 3.
High-stakes applications
Legal, medical, financial - where precision matters most.
- 4.
RAG with LLM generation
Better context = better LLM outputs. Worth the latency.
Skip Reranking When
- 1.
Latency-critical
Search-as-you-type, real-time autocomplete.
- 2.
Simple keyword lookups
Product searches, navigation queries.
- 3.
Already high precision
If bi-encoder results are good enough, skip the cost.
- 4.
Cost-sensitive at scale
Millions of queries/day with tight margins.
Benchmark: NDCG Improvement from Reranking
Reranking typically improves NDCG@10 by 5-15% over bi-encoder alone. The gains are largest on complex queries and long documents.
MS MARCO Passage Ranking (NDCG@10)
| Method | NDCG@10 | Latency | Improvement |
|---|---|---|---|
| BM25 only | 0.228 | 5ms | baseline |
| BGE bi-encoder | 0.343 | 20ms | +50.4% |
| BGE bi-encoder + MiniLM reranker | 0.389 | 80ms | +13.4% |
| BGE bi-encoder + BGE reranker | 0.412 | 150ms | +20.1% |
| Hybrid + Cohere rerank | 0.435 | 200ms | +26.8% |
Key Insight
The best results combine hybrid search (BM25 + vector) with cross-encoder reranking. This three-stage pipeline (BM25 || vector -> RRF fusion -> rerank) achieves the highest quality while keeping latency under 200ms.
Production Patterns
Here are battle-tested patterns for implementing reranking in production.
Pattern 1: Retrieve 100, Rerank to 10
The standard approach. Cast a wide net with bi-encoder, then filter to the best results.
# Stage 1: Get top 100 candidates (fast) candidates = vector_search(query, k=100) # Stage 2: Rerank to top 10 (slower, higher quality) reranked = reranker.predict([(query, doc) for doc in candidates]) top_10 = sorted(zip(candidates, reranked), key=lambda x: x[1], reverse=True)[:10]
Pattern 2: Conditional Reranking
Only rerank when query is complex or initial results have low confidence.
def should_rerank(query: str, initial_scores: list) -> bool:
# Complex query detection
if len(query.split()) > 5:
return True
# Low confidence initial results
if max(initial_scores) < 0.7:
return True
# High variance in scores (unclear ranking)
if np.std(initial_scores[:10]) > 0.15:
return True
return FalsePattern 3: Batched Reranking
For high-throughput systems, batch multiple query-document pairs.
# Batch inference is much faster than sequential
all_pairs = []
for query, candidates in queries_with_candidates:
all_pairs.extend([(query, doc) for doc in candidates])
# Single batched call
all_scores = reranker.predict(all_pairs, batch_size=32)
# Unbatch results
idx = 0
results = []
for query, candidates in queries_with_candidates:
n = len(candidates)
scores = all_scores[idx:idx+n]
results.append(sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True))
idx += nKey Takeaways
- 1
Bi-encoders are fast but imprecise - They encode query and document separately, missing nuanced relationships.
- 2
Cross-encoders see both together - Full attention between query and document yields much higher precision.
- 3
Two-stage retrieval is the pattern - Retrieve top 100 with bi-encoder, rerank to top 10 with cross-encoder.
- 4
BGE-reranker-large is the best open-source option - For API, Cohere Rerank is production-ready with multilingual support.