Cross-Encoder Reranking
Re-score retrieved passages with a cross-encoder to boost search precision.
How Text Reranking Works
Why fast retrieval is not enough, and how reranking fixes the precision problem.
Vector search (bi-encoders) is fast because it pre-computes document embeddings. But this speed comes at a cost: the model never sees your query and document together. It can only compare their embeddings, missing nuanced semantic matches.
Cross-encoders see the query and document together in one forward pass. This allows the model to understand the relationship between them directly, catching subtle relevance signals that embeddings miss.
Bi-Encoder vs Cross-Encoder: The Architecture Difference
These two architectures make fundamentally different tradeoffs. Understanding when to use each is key.
O(1) per queryafter indexingO(n) per queryn = number of documents to scoreInteractive: Watch Reranking in Action
See how the ranking changes when we apply a cross-encoder to the bi-encoder results.
The Two-Stage Retrieval Pipeline
Reranking is not a replacement for vector search - it is a refinement layer on top.
- - Use bi-encoder (e.g., E5, BGE, OpenAI embeddings)
- - Fetch top 50-200 candidates from vector DB
- - Latency: ~10-50ms for millions of docs
- - Goal: High recall (don't miss relevant docs)
- - Use cross-encoder on retrieved candidates
- - Score each query-doc pair independently
- - Latency: ~20-100ms for 50 docs
- - Goal: High precision (rank best docs first)
- - Take top-k reranked results (k=3-10)
- - Pass to LLM as context for RAG
- - Or display directly to user in search
- - Total latency: ~50-200ms end-to-end
Score Calibration: Making Scores Meaningful
Raw model scores are not probabilities. Calibration transforms them into interpretable relevance scores.
Different models use different score scales. A score of 0.7 might be excellent for one model but mediocre for another. Cross-encoders often output logits that need sigmoid normalization.
score = 1 / (1 + exp(-logit))score = softmax(logits / temperature)score = (x - min) / (max - min)After calibration, you can set meaningful thresholds. Common approach: evaluate on a labeled dataset to find the threshold that maximizes F1 score or achieves desired precision/recall.
Reranking Methods Compared
From managed APIs to self-hosted models, choose based on your latency, accuracy, and cost requirements.
Common Pitfalls
The Complete Picture
Text reranking solves the precision problem in retrieval systems. Bi-encoders are fast but approximate. Cross-encoders are slow but precise. By using both in a two-stage pipeline, you get the best of both worlds: sub-second latency with state-of-the-art relevance ranking. Whether you use Cohere's API, BGE Reranker, or a fine-tuned cross-encoder, the key insight remains the same - joint attention between query and document captures semantic relationships that embedding comparison cannot.
Use Cases
- ✓RAG retrieval quality
- ✓E-commerce search
- ✓Legal/medical search
- ✓Recommendations
Architectural Patterns
Bi-encoder + Cross-encoder
Retrieve many with dense vectors, rerank top-k with cross-encoder.
Late Interaction
Efficient token-level scoring (ColBERT-style).
Implementations
API Services
Cohere Rerank
CohereState-of-the-art reranker for English + multilingual.
Open Source
Benchmarks
Quick Facts
- Input
- Text
- Output
- Structured Data
- Implementations
- 2 open source, 1 API
- Patterns
- 2 approaches