Natural Language Processingsummarization

Text Summarization

Text summarization compresses documents while preserving key information — a task that became dramatically more capable with LLMs but also harder to evaluate. PEGASUS (2020) and BART set the encoder-decoder baseline, but GPT-4 and Claude produce summaries that human evaluators often prefer over reference summaries, breaking ROUGE as a meaningful metric. CNN/DailyMail and XSum remain standard benchmarks, but the field is moving toward long-document summarization (books, legal filings, earnings calls) where 100K+ token context windows are finally making single-pass summarization feasible. The core unsolved problem is faithfulness — even frontier models hallucinate facts in roughly 5-15% of summaries, making factual consistency the critical metric that separates production-ready from demo-ready.

Datasets

Results

rouge-1

Canonical metric

Canonical Benchmark

CNN/DailyMail

300K news articles with multi-sentence summaries. Standard benchmark for abstractive summarization.

Primary metric: rouge-1

View full leaderboard

Top 10

Leading models on CNN/DailyMail.

No results yet. Be the first to contribute.

What were you looking for on Text Summarization?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

1 dataset tracked for this task.

CNN/DailyMail

CANONICAL

0results·rouge-1

Related tasks

Other tasks in Natural Language Processing.

Feature Extraction Fill-Mask Language Modeling Machine Translation Named Entity Recognition Natural Language Inference Polish Conversation Quality Polish Cultural Competency

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Text Summarization? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.