Natural Language Processingtext-ranking

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

2 datasets9 resultsView full task mapping →

Text ranking orders documents by relevance to a query, underpinning search engines, RAG pipelines, and recommendation systems. Cross-encoders (like ms-marco models) provide the best accuracy, while bi-encoders and late-interaction models (ColBERT) enable scalable retrieval. The modern stack is bi-encoder retrieval + cross-encoder reranking.

History

2009

Learning to Rank (LTR) with LambdaMART and gradient-boosted trees becomes the industry standard for web search

2016

MS MARCO passage ranking dataset released — 8.8M passages, 1M queries; becomes THE ranking benchmark

2019

BERT for passage re-ranking (Nogueira & Cho) shows transformers dramatically outperform traditional LTR features

2020

ColBERT (Khattab & Zaharia) introduces late interaction — fast retrieval with token-level matching

2020

DPR (Karpukhin et al.) shows dense bi-encoders can replace BM25 for first-stage retrieval

2022

E5 and Contriever demonstrate that contrastive pretraining produces strong general-purpose retrievers

2023

RankGPT and RankLLaMA use LLMs for listwise reranking — LLMs compare and order passages directly

2024

ColBERT-v2, SPLADE++, and hybrid sparse-dense retrieval mature for production; BGE-reranker-v2 tops reranking benchmarks

2025

BEIR benchmark drives focus on zero-shot ranking across diverse domains; NV-RerankQA and Jina-reranker-v2 push accuracy further

How Text Ranking Works

Query encoding

The user query is encoded into a dense vector (bi-encoder) or processed jointly with each candidate (cross-encoder)

First-stage retrieval

BM25 (sparse) or a dense bi-encoder retrieves top 100-1000 candidate passages from the full corpus via ANN search

Reranking

A cross-encoder scores each (query, passage) pair jointly, attending to fine-grained interactions between tokens

Score fusion (optional)

Hybrid search combines BM25 and dense scores using reciprocal rank fusion (RRF) for better recall

Result ordering

Final passages are ordered by reranker score and returned with relevance scores for downstream consumption

Current Landscape

Text ranking in 2025 follows a clear two-stage paradigm: fast retrieval (BM25, dense bi-encoder, or hybrid) followed by accurate reranking (cross-encoder or LLM). MS MARCO remains the training data backbone, but BEIR has revealed that domain-specific evaluation matters — models that excel on web search may fail on scientific or legal ranking. The rise of RAG has made ranking quality directly visible to end users: bad retrieval = bad LLM answers. ColBERT-style late interaction is emerging as the practical middle ground between speed and quality.

Key Challenges

Cross-encoder reranking is expensive — scoring 1000 passages per query with a 400M param model takes ~1 second on GPU

Zero-shot domain transfer: models trained on MS MARCO (web search) degrade on biomedical, legal, and scientific queries

Evaluation beyond MS MARCO: BEIR's diverse domains reveal that no single model dominates across all verticals

Query-document length mismatch: short queries vs. long documents require asymmetric encoding strategies

Freshness: keeping dense indices updated as documents change is operationally complex

Quick Recommendations

Best reranking accuracy

BGE-reranker-v2-m3 or Jina-reranker-v2-base-multilingual

Top cross-encoder accuracy on MS MARCO and BEIR; multilingual support

Scalable first-stage retrieval

E5-large-v2 or Nomic-embed-text-v1.5

Strong bi-encoder with fast ANN lookup; pairs well with any reranker

Best of both worlds

ColBERT-v2 (late interaction)

Token-level matching with precomputed document embeddings; faster than cross-encoder, more accurate than bi-encoder

LLM-based reranking

GPT-4o or Claude 3.5 Sonnet (listwise)

Listwise reranking in a single prompt; best for complex or nuanced relevance judgments

Hybrid sparse-dense

BM25 + E5 with RRF fusion

BM25 catches exact keyword matches that dense models miss; fusion consistently improves recall

What's Next

The frontier is learned sparse retrieval (SPLADE) that combines the interpretability of keyword search with neural relevance, multi-vector representations that go beyond single-vector bi-encoders, and end-to-end ranking models that jointly optimize retrieval and generation for RAG. Expect ranking to merge with LLM inference — models that retrieve, reason, and generate in a single forward pass.

Benchmarks & SOTA

BEIR

20215 results

Heterogeneous information retrieval benchmark across 18 datasets

State of the Art

NV-Embed-v2

NVIDIA

62.65

ndcg@10

MS MARCO

20164 results

Large-scale passage ranking benchmark from Bing search queries

State of the Art

RankLLaMA-7B

Castorini (Waterloo)

41.8

mrr@10

Related Tasks

Question Answering

Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Text Ranking benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing