Text summarization compresses documents while preserving key information — a task that became dramatically more capable with LLMs but also harder to evaluate. PEGASUS (2020) and BART set the encoder-decoder baseline, but GPT-4 and Claude produce summaries that human evaluators often prefer over reference summaries, breaking ROUGE as a meaningful metric. CNN/DailyMail and XSum remain standard benchmarks, but the field is moving toward long-document summarization (books, legal filings, earnings calls) where 100K+ token context windows are finally making single-pass summarization feasible. The core unsolved problem is faithfulness — even frontier models hallucinate facts in roughly 5-15% of summaries, making factual consistency the critical metric that separates production-ready from demo-ready.
300K news articles with multi-sentence summaries. Standard benchmark for abstractive summarization.
Leading models on CNN/DailyMail.
No results yet. Be the first to contribute.
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
1 dataset tracked for this task.
Other tasks in Natural Language Processing.
Still looking for something on Text Summarization? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.