Natural Language Processingsummarization

Text Summarization

Text summarization compresses documents while preserving key information — a task that became dramatically more capable with LLMs but also harder to evaluate. PEGASUS (2020) and BART set the encoder-decoder baseline, but GPT-4 and Claude produce summaries that human evaluators often prefer over reference summaries, breaking ROUGE as a meaningful metric. CNN/DailyMail and XSum remain standard benchmarks, but the field is moving toward long-document summarization (books, legal filings, earnings calls) where 100K+ token context windows are finally making single-pass summarization feasible. The core unsolved problem is faithfulness — even frontier models hallucinate facts in roughly 5-15% of summaries, making factual consistency the critical metric that separates production-ready from demo-ready.

1 datasets16 resultsView full task mapping →

Text summarization condenses documents into shorter versions while preserving key information. LLMs have largely replaced fine-tuned models for abstractive summarization, with Claude and GPT-4 producing human-preferred summaries. The remaining challenges are faithfulness (no hallucinated facts), long-document handling, and evaluation metrics that actually correlate with quality.

History

2004

ROUGE metric (Lin) becomes the standard automatic evaluation for summarization, despite known limitations

2015

Rush et al. introduce neural abstractive summarization with attention-based seq2seq models

2017

See et al.'s Pointer-Generator network addresses OOV words and repetition in abstractive summarization

2019

BART (Lewis et al.) and T5 (Raffel et al.) achieve SOTA on CNN/DailyMail through denoising pretraining

2020

PEGASUS (Zhang et al., Google) introduces gap-sentence pretraining specifically designed for summarization

2021

Longformer and LED extend transformer context to 16K tokens for long-document summarization

2022

InstructGPT and ChatGPT produce summaries preferred by humans over fine-tuned models, challenging ROUGE-based evaluation

2023

Claude 100K and GPT-4 32K enable summarization of entire books and reports in a single pass

2024

Gemini 1.5 (1M tokens) and Claude 3.5 handle document collections; faithfulness evaluation (FactScore, AlignScore) matures

How Text Summarization Works

Document encoding

The full document is encoded by the transformer; long documents may be chunked or use sparse attention patterns

Content selection

Extractive approaches select key sentences; abstractive models learn implicitly which content to include through attention

Abstractive generation

The decoder generates a summary token by token, paraphrasing and compressing the source material

Faithfulness control

Advanced systems add post-hoc fact verification or constrained decoding to prevent hallucinated facts

Length control

Summary length is controlled via max tokens, length penalties in beam search, or explicit instructions in LLM prompts

Current Landscape

Summarization in 2025 has been transformed by LLMs. Fine-tuned models like BART and PEGASUS are now legacy for high-quality summarization — GPT-4 and Claude produce summaries that humans prefer 70%+ of the time. The real innovation is in evaluation: ROUGE is being supplemented by LLM-as-judge and factual consistency metrics (FactScore, SUMMAC). Production systems increasingly use LLMs for quality and extractive methods as fallback for guaranteed faithfulness.

Key Challenges

Faithfulness: abstractive models hallucinate facts not present in the source document ~30% of the time in standard benchmarks

ROUGE is a poor proxy for summary quality — it measures n-gram overlap, not informativeness or coherence

Long-document summarization (books, legal filings, reports) still suffers from information loss in the middle of context windows

Multi-document summarization with conflicting information requires reconciliation and source attribution

Domain-specific summarization (medical, legal) needs terminology precision that general models lack

Quick Recommendations

Best quality (general)

Claude 3.5 Sonnet or GPT-4o

Human-preferred summaries with strong faithfulness; handles documents up to 100K+ tokens

Scientific paper summarization

Claude 3.5 with structured prompting

Can summarize full papers including methods and results with technical accuracy

Production (self-hosted)

BART-large-CNN or PEGASUS-large fine-tuned

400M params, fast inference, good quality for news-style summarization

Extractive summarization

BERTSum or LexRank

No hallucination risk; selects verbatim sentences from the source

What's Next

The frontier is interactive summarization (users refine and drill into summaries conversationally), multi-modal summarization (combining text, figures, and tables), and guaranteed-faithful generation with formal verification of factual claims against source documents. Expect specialized summarization agents that produce different summary types (executive brief, technical detail, key findings) from a single document.

Benchmarks & SOTA

CNN/DailyMail

CNN/DailyMail Summarization

201516 results

300K news articles with multi-sentence summaries. Standard benchmark for abstractive summarization.

State of the Art

BRIO

Yale NLP

47.78

rouge-1

Related Tasks

Question Answering

Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Text Summarization benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing