Level 3: Production~20 min

Reranking

Two-stage retrieval: fast recall first, then precise reranking. The secret to production RAG quality.

The Problem: Bi-encoders Are Fast But Imprecise

In Lesson 1.1, you learned about embedding models (bi-encoders). They encode queries and documents separately, then compare with dot product. This is fast but misses nuanced relationships between query and document.

Bi-encoder Limitation: No Cross-Attention

Bi-encoder process:

Query:"python memory leak"->[0.2, 0.8, ...]

Doc:"fixing memory issues"->[0.3, 0.7, ...]

Encoded independently, compared with dot product.

The problem:

Query and document never "see" each other during encoding. The model cannot reason about how they relate.

"python memory leak" and "python memory management" get similar scores even though one is about problems, the other about concepts.

The Solution: Cross-Encoder Reranking

A cross-encoder processes query and document together, allowing full attention between them. It outputs a single relevance score.

Cross-Encoder Process

Input

"[CLS] query [SEP] document [SEP]"

Model

Cross-Encoder

Full attention

Output

0.87

Relevance score

Bi-encoder (Stage 1)

+Extremely fast (embeddings pre-computed)
+Scales to millions of documents
+Good for broad recall (top 100)
-Limited precision for nuanced queries

Cross-encoder (Stage 2)

+High precision relevance scoring
+Understands query-document relationships
+Great for ranking top candidates
-Slow: O(n) inference per query

Implementation: Two-Stage Retrieval

The standard pattern: retrieve top-k candidates with bi-encoder, then rerank with cross-encoder.

# pip install sentence-transformers

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

# Stage 1: Bi-encoder for initial retrieval
bi_encoder = SentenceTransformer('BAAI/bge-small-en-v1.5')

documents = [
    "Python memory leak debugging techniques",
    "JavaScript garbage collection explained",
    "How to profile memory usage in Python applications",
    "Understanding Python memory management internals",
    "Memory optimization strategies for large datasets",
    "Fixing out of memory errors in Python",
    "Python virtual memory and swap usage",
    "Common causes of memory leaks in web applications"
]

# Pre-compute document embeddings
doc_embeddings = bi_encoder.encode(documents, normalize_embeddings=True)

# Query
query = "python memory leak"
query_embedding = bi_encoder.encode(query, normalize_embeddings=True)

# Get top-100 candidates (in production, this would be more)
scores = np.dot(doc_embeddings, query_embedding)
top_k_indices = np.argsort(scores)[::-1][:100]
candidates = [documents[i] for i in top_k_indices]

print("Stage 1 (Bi-encoder) - Top 5:")
for i, idx in enumerate(top_k_indices[:5]):
    print(f"  {scores[idx]:.3f}: {documents[idx]}")

# Stage 2: Cross-encoder reranking
reranker = CrossEncoder('BAAI/bge-reranker-large')

# Create query-document pairs
pairs = [[query, doc] for doc in candidates]

# Get reranker scores
reranker_scores = reranker.predict(pairs)

# Sort by reranker scores
reranked_indices = np.argsort(reranker_scores)[::-1]

print("\nStage 2 (Cross-encoder) - Top 5:")
for i in reranked_indices[:5]:
    print(f"  {reranker_scores[i]:.3f}: {candidates[i]}")

Expected output:

Stage 1 (Bi-encoder) - Top 5:
  0.821: Python memory leak debugging techniques
  0.789: Understanding Python memory management internals
  0.756: How to profile memory usage in Python applications
  0.734: Fixing out of memory errors in Python
  0.698: Memory optimization strategies for large datasets

Stage 2 (Cross-encoder) - Top 5:
  0.967: Python memory leak debugging techniques
  0.891: Fixing out of memory errors in Python
  0.823: How to profile memory usage in Python applications
  0.654: Understanding Python memory management internals
  0.432: Common causes of memory leaks in web applications

Note: Cross-encoder promotes "Fixing out of memory errors" (more action-oriented match for "leak").

Popular Reranker Models

Several high-quality reranker models are available, both open source and API-based.

BGE

BAAI/bge-reranker-large

Best open-source reranker. Trained on MS MARCO. 560M parameters.

NDCG@10: 0.712Latency: ~50ms/pairFree, local inference

Cohere

Cohere Rerank

Production-grade API. Multilingual support. Very high quality.

NDCG@10: 0.73+Latency: ~100ms$0.0005/query

MARCO

cross-encoder/ms-marco-MiniLM-L-6-v2

Lightweight, fast. Good balance of speed and quality.

NDCG@10: 0.68Latency: ~15ms/pair22M parameters

Jina

jina-reranker-v2-base-multilingual

Multilingual reranker. 100+ languages supported.

NDCG@10: 0.70Multilingual278M parameters

# Cohere API example

import cohere

co = cohere.Client("your-api-key")

query = "python memory leak"
documents = [
    "Python memory leak debugging techniques",
    "JavaScript garbage collection explained",
    "Fixing out of memory errors in Python"
]

response = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=documents,
    top_n=3,
    return_documents=True
)

for result in response.results:
    print(f"{result.relevance_score:.3f}: {result.document.text}")

When Reranking Helps Most

Reranking adds latency and cost. It's most valuable in specific scenarios:

High Value Scenarios

1.
Long documents
Bi-encoders struggle with long text. Cross-encoders handle full context.
2.
Complex queries
Multi-part questions, negation, specific requirements.
3.
High-stakes applications
Legal, medical, financial - where precision matters most.
4.
RAG with LLM generation
Better context = better LLM outputs. Worth the latency.

Skip Reranking When

1.
Latency-critical
Search-as-you-type, real-time autocomplete.
2.
Simple keyword lookups
Product searches, navigation queries.
3.
Already high precision
If bi-encoder results are good enough, skip the cost.
4.
Cost-sensitive at scale
Millions of queries/day with tight margins.

Benchmark: NDCG Improvement from Reranking

Reranking typically improves NDCG@10 by 5-15% over bi-encoder alone. The gains are largest on complex queries and long documents.

MS MARCO Passage Ranking (NDCG@10)

Method	NDCG@10	Latency	Improvement
BM25 only	0.228	5ms	baseline
BGE bi-encoder	0.343	20ms	+50.4%
BGE bi-encoder + MiniLM reranker	0.389	80ms	+13.4%
BGE bi-encoder + BGE reranker	0.412	150ms	+20.1%
Hybrid + Cohere rerank	0.435	200ms	+26.8%

Key Insight

The best results combine hybrid search (BM25 + vector) with cross-encoder reranking. This three-stage pipeline (BM25 || vector -> RRF fusion -> rerank) achieves the highest quality while keeping latency under 200ms.

Production Patterns

Here are battle-tested patterns for implementing reranking in production.

Pattern 1: Retrieve 100, Rerank to 10

The standard approach. Cast a wide net with bi-encoder, then filter to the best results.

# Stage 1: Get top 100 candidates (fast)
candidates = vector_search(query, k=100)

# Stage 2: Rerank to top 10 (slower, higher quality)
reranked = reranker.predict([(query, doc) for doc in candidates])
top_10 = sorted(zip(candidates, reranked), key=lambda x: x[1], reverse=True)[:10]

Pattern 2: Conditional Reranking

Only rerank when query is complex or initial results have low confidence.

def should_rerank(query: str, initial_scores: list) -> bool:
    # Complex query detection
    if len(query.split()) > 5:
        return True
    # Low confidence initial results
    if max(initial_scores) < 0.7:
        return True
    # High variance in scores (unclear ranking)
    if np.std(initial_scores[:10]) > 0.15:
        return True
    return False

Pattern 3: Batched Reranking

For high-throughput systems, batch multiple query-document pairs.

# Batch inference is much faster than sequential
all_pairs = []
for query, candidates in queries_with_candidates:
    all_pairs.extend([(query, doc) for doc in candidates])

# Single batched call
all_scores = reranker.predict(all_pairs, batch_size=32)

# Unbatch results
idx = 0
results = []
for query, candidates in queries_with_candidates:
    n = len(candidates)
    scores = all_scores[idx:idx+n]
    results.append(sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True))
    idx += n

Key Takeaways

1
Bi-encoders are fast but imprecise - They encode query and document separately, missing nuanced relationships.
2
Cross-encoders see both together - Full attention between query and document yields much higher precision.
3
Two-stage retrieval is the pattern - Retrieve top 100 with bi-encoder, rerank to top 10 with cross-encoder.
4
BGE-reranker-large is the best open-source option - For API, Cohere Rerank is production-ready with multilingual support.

Next: Document RAG Previous: Hybrid Search