Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Text ranking orders documents by relevance to a query, underpinning search engines, RAG pipelines, and recommendation systems. Cross-encoders (like ms-marco models) provide the best accuracy, while bi-encoders and late-interaction models (ColBERT) enable scalable retrieval. The modern stack is bi-encoder retrieval + cross-encoder reranking.
History
Learning to Rank (LTR) with LambdaMART and gradient-boosted trees becomes the industry standard for web search
MS MARCO passage ranking dataset released — 8.8M passages, 1M queries; becomes THE ranking benchmark
BERT for passage re-ranking (Nogueira & Cho) shows transformers dramatically outperform traditional LTR features
ColBERT (Khattab & Zaharia) introduces late interaction — fast retrieval with token-level matching
DPR (Karpukhin et al.) shows dense bi-encoders can replace BM25 for first-stage retrieval
E5 and Contriever demonstrate that contrastive pretraining produces strong general-purpose retrievers
RankGPT and RankLLaMA use LLMs for listwise reranking — LLMs compare and order passages directly
ColBERT-v2, SPLADE++, and hybrid sparse-dense retrieval mature for production; BGE-reranker-v2 tops reranking benchmarks
BEIR benchmark drives focus on zero-shot ranking across diverse domains; NV-RerankQA and Jina-reranker-v2 push accuracy further
How Text Ranking Works
Query encoding
The user query is encoded into a dense vector (bi-encoder) or processed jointly with each candidate (cross-encoder)
First-stage retrieval
BM25 (sparse) or a dense bi-encoder retrieves top 100-1000 candidate passages from the full corpus via ANN search
Reranking
A cross-encoder scores each (query, passage) pair jointly, attending to fine-grained interactions between tokens
Score fusion (optional)
Hybrid search combines BM25 and dense scores using reciprocal rank fusion (RRF) for better recall
Result ordering
Final passages are ordered by reranker score and returned with relevance scores for downstream consumption
Current Landscape
Text ranking in 2025 follows a clear two-stage paradigm: fast retrieval (BM25, dense bi-encoder, or hybrid) followed by accurate reranking (cross-encoder or LLM). MS MARCO remains the training data backbone, but BEIR has revealed that domain-specific evaluation matters — models that excel on web search may fail on scientific or legal ranking. The rise of RAG has made ranking quality directly visible to end users: bad retrieval = bad LLM answers. ColBERT-style late interaction is emerging as the practical middle ground between speed and quality.
Key Challenges
Cross-encoder reranking is expensive — scoring 1000 passages per query with a 400M param model takes ~1 second on GPU
Zero-shot domain transfer: models trained on MS MARCO (web search) degrade on biomedical, legal, and scientific queries
Evaluation beyond MS MARCO: BEIR's diverse domains reveal that no single model dominates across all verticals
Query-document length mismatch: short queries vs. long documents require asymmetric encoding strategies
Freshness: keeping dense indices updated as documents change is operationally complex
Quick Recommendations
Best reranking accuracy
BGE-reranker-v2-m3 or Jina-reranker-v2-base-multilingual
Top cross-encoder accuracy on MS MARCO and BEIR; multilingual support
Scalable first-stage retrieval
E5-large-v2 or Nomic-embed-text-v1.5
Strong bi-encoder with fast ANN lookup; pairs well with any reranker
Best of both worlds
ColBERT-v2 (late interaction)
Token-level matching with precomputed document embeddings; faster than cross-encoder, more accurate than bi-encoder
LLM-based reranking
GPT-4o or Claude 3.5 Sonnet (listwise)
Listwise reranking in a single prompt; best for complex or nuanced relevance judgments
Hybrid sparse-dense
BM25 + E5 with RRF fusion
BM25 catches exact keyword matches that dense models miss; fusion consistently improves recall
What's Next
The frontier is learned sparse retrieval (SPLADE) that combines the interpretability of keyword search with neural relevance, multi-vector representations that go beyond single-vector bi-encoders, and end-to-end ranking models that jointly optimize retrieval and generation for RAG. Expect ranking to merge with LLM inference — models that retrieve, reason, and generate in a single forward pass.
Benchmarks & SOTA
Related Tasks
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Reading Comprehension
Understanding and answering questions about passages.
Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.
Question Answering
Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.
Something wrong or missing?
Help keep Text Ranking benchmarks accurate. Report outdated results, missing benchmarks, or errors.