Text Summarization

Text summarization compresses documents while preserving key information — a task that became dramatically more capable with LLMs but also harder to evaluate. PEGASUS (2020) and BART set the encoder-decoder baseline, but GPT-4 and Claude produce summaries that human evaluators often prefer over reference summaries, breaking ROUGE as a meaningful metric. CNN/DailyMail and XSum remain standard benchmarks, but the field is moving toward long-document summarization (books, legal filings, earnings calls) where 100K+ token context windows are finally making single-pass summarization feasible. The core unsolved problem is faithfulness — even frontier models hallucinate facts in roughly 5-15% of summaries, making factual consistency the critical metric that separates production-ready from demo-ready.

1
Datasets
0
Results
rouge-1
Canonical metric
Canonical Benchmark

CNN/DailyMail

300K news articles with multi-sentence summaries. Standard benchmark for abstractive summarization.

Primary metric: rouge-1
View full leaderboard

Top 10

Leading models on CNN/DailyMail.

No results yet. Be the first to contribute.

What were you looking for on Text Summarization?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Natural Language Processing.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Text Summarization? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.