Natural Language Processingtext-ranking

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

2 datasets8 resultsView full task mapping →

Text ranking orders documents by relevance to a query, underpinning search engines, RAG pipelines, and recommendation systems. Cross-encoders (like ms-marco models) provide the best accuracy, while bi-encoders and late-interaction models (ColBERT) enable scalable retrieval. The modern stack is bi-encoder retrieval + cross-encoder reranking.

History

2009

Learning to Rank (LTR) with LambdaMART and gradient-boosted trees becomes the industry standard for web search

2016

MS MARCO passage ranking dataset released — 8.8M passages, 1M queries; becomes THE ranking benchmark

2019

BERT for passage re-ranking (Nogueira & Cho) shows transformers dramatically outperform traditional LTR features

2020

ColBERT (Khattab & Zaharia) introduces late interaction — fast retrieval with token-level matching

2020

DPR (Karpukhin et al.) shows dense bi-encoders can replace BM25 for first-stage retrieval

2022

E5 and Contriever demonstrate that contrastive pretraining produces strong general-purpose retrievers

2023

RankGPT and RankLLaMA use LLMs for listwise reranking — LLMs compare and order passages directly

2024

ColBERT-v2, SPLADE++, and hybrid sparse-dense retrieval mature for production; BGE-reranker-v2 tops reranking benchmarks

2025

BEIR benchmark drives focus on zero-shot ranking across diverse domains; NV-RerankQA and Jina-reranker-v2 push accuracy further

How Text Ranking Works

1Query encodingThe user query is encoded i…2First-stage retrievalBM25 (sparse) or a dense bi…3RerankingA cross-encoder scores each…4Score fusion (optiona…Hybrid search combines BM25…5Result orderingFinal passages are ordered …Text Ranking Pipeline
1

Query encoding

The user query is encoded into a dense vector (bi-encoder) or processed jointly with each candidate (cross-encoder)

2

First-stage retrieval

BM25 (sparse) or a dense bi-encoder retrieves top 100-1000 candidate passages from the full corpus via ANN search

3

Reranking

A cross-encoder scores each (query, passage) pair jointly, attending to fine-grained interactions between tokens

4

Score fusion (optional)

Hybrid search combines BM25 and dense scores using reciprocal rank fusion (RRF) for better recall

5

Result ordering

Final passages are ordered by reranker score and returned with relevance scores for downstream consumption

Current Landscape

Text ranking in 2025 follows a clear two-stage paradigm: fast retrieval (BM25, dense bi-encoder, or hybrid) followed by accurate reranking (cross-encoder or LLM). MS MARCO remains the training data backbone, but BEIR has revealed that domain-specific evaluation matters — models that excel on web search may fail on scientific or legal ranking. The rise of RAG has made ranking quality directly visible to end users: bad retrieval = bad LLM answers. ColBERT-style late interaction is emerging as the practical middle ground between speed and quality.

Key Challenges

Cross-encoder reranking is expensive — scoring 1000 passages per query with a 400M param model takes ~1 second on GPU

Zero-shot domain transfer: models trained on MS MARCO (web search) degrade on biomedical, legal, and scientific queries

Evaluation beyond MS MARCO: BEIR's diverse domains reveal that no single model dominates across all verticals

Query-document length mismatch: short queries vs. long documents require asymmetric encoding strategies

Freshness: keeping dense indices updated as documents change is operationally complex

Quick Recommendations

Best reranking accuracy

BGE-reranker-v2-m3 or Jina-reranker-v2-base-multilingual

Top cross-encoder accuracy on MS MARCO and BEIR; multilingual support

Scalable first-stage retrieval

E5-large-v2 or Nomic-embed-text-v1.5

Strong bi-encoder with fast ANN lookup; pairs well with any reranker

Best of both worlds

ColBERT-v2 (late interaction)

Token-level matching with precomputed document embeddings; faster than cross-encoder, more accurate than bi-encoder

LLM-based reranking

GPT-4o or Claude 3.5 Sonnet (listwise)

Listwise reranking in a single prompt; best for complex or nuanced relevance judgments

Hybrid sparse-dense

BM25 + E5 with RRF fusion

BM25 catches exact keyword matches that dense models miss; fusion consistently improves recall

What's Next

The frontier is learned sparse retrieval (SPLADE) that combines the interpretability of keyword search with neural relevance, multi-vector representations that go beyond single-vector bi-encoders, and end-to-end ranking models that jointly optimize retrieval and generation for RAG. Expect ranking to merge with LLM inference — models that retrieve, reason, and generate in a single forward pass.

Benchmarks & SOTA

Related Tasks

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

Reading Comprehension

Understanding and answering questions about passages.

Table Question Answering

Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.

Question Answering

Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.

Something wrong or missing?

Help keep Text Ranking benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000