Natural Language Processingsummarization

Text Summarization

Text summarization compresses documents while preserving key information — a task that became dramatically more capable with LLMs but also harder to evaluate. PEGASUS (2020) and BART set the encoder-decoder baseline, but GPT-4 and Claude produce summaries that human evaluators often prefer over reference summaries, breaking ROUGE as a meaningful metric. CNN/DailyMail and XSum remain standard benchmarks, but the field is moving toward long-document summarization (books, legal filings, earnings calls) where 100K+ token context windows are finally making single-pass summarization feasible. The core unsolved problem is faithfulness — even frontier models hallucinate facts in roughly 5-15% of summaries, making factual consistency the critical metric that separates production-ready from demo-ready.

1 datasets15 resultsView full task mapping →

Text summarization condenses documents into shorter versions while preserving key information. LLMs have largely replaced fine-tuned models for abstractive summarization, with Claude and GPT-4 producing human-preferred summaries. The remaining challenges are faithfulness (no hallucinated facts), long-document handling, and evaluation metrics that actually correlate with quality.

History

2004

ROUGE metric (Lin) becomes the standard automatic evaluation for summarization, despite known limitations

2015

Rush et al. introduce neural abstractive summarization with attention-based seq2seq models

2017

See et al.'s Pointer-Generator network addresses OOV words and repetition in abstractive summarization

2019

BART (Lewis et al.) and T5 (Raffel et al.) achieve SOTA on CNN/DailyMail through denoising pretraining

2020

PEGASUS (Zhang et al., Google) introduces gap-sentence pretraining specifically designed for summarization

2021

Longformer and LED extend transformer context to 16K tokens for long-document summarization

2022

InstructGPT and ChatGPT produce summaries preferred by humans over fine-tuned models, challenging ROUGE-based evaluation

2023

Claude 100K and GPT-4 32K enable summarization of entire books and reports in a single pass

2024

Gemini 1.5 (1M tokens) and Claude 3.5 handle document collections; faithfulness evaluation (FactScore, AlignScore) matures

How Text Summarization Works

1Document encodingThe full document is encode…2Content selectionExtractive approaches selec…3Abstractive generationThe decoder generates a sum…4Faithfulness controlAdvanced systems add post-h…5Length controlSummary length is controlle…Text Summarization Pipeline
1

Document encoding

The full document is encoded by the transformer; long documents may be chunked or use sparse attention patterns

2

Content selection

Extractive approaches select key sentences; abstractive models learn implicitly which content to include through attention

3

Abstractive generation

The decoder generates a summary token by token, paraphrasing and compressing the source material

4

Faithfulness control

Advanced systems add post-hoc fact verification or constrained decoding to prevent hallucinated facts

5

Length control

Summary length is controlled via max tokens, length penalties in beam search, or explicit instructions in LLM prompts

Current Landscape

Summarization in 2025 has been transformed by LLMs. Fine-tuned models like BART and PEGASUS are now legacy for high-quality summarization — GPT-4 and Claude produce summaries that humans prefer 70%+ of the time. The real innovation is in evaluation: ROUGE is being supplemented by LLM-as-judge and factual consistency metrics (FactScore, SUMMAC). Production systems increasingly use LLMs for quality and extractive methods as fallback for guaranteed faithfulness.

Key Challenges

Faithfulness: abstractive models hallucinate facts not present in the source document ~30% of the time in standard benchmarks

ROUGE is a poor proxy for summary quality — it measures n-gram overlap, not informativeness or coherence

Long-document summarization (books, legal filings, reports) still suffers from information loss in the middle of context windows

Multi-document summarization with conflicting information requires reconciliation and source attribution

Domain-specific summarization (medical, legal) needs terminology precision that general models lack

Quick Recommendations

Best quality (general)

Claude 3.5 Sonnet or GPT-4o

Human-preferred summaries with strong faithfulness; handles documents up to 100K+ tokens

Scientific paper summarization

Claude 3.5 with structured prompting

Can summarize full papers including methods and results with technical accuracy

Production (self-hosted)

BART-large-CNN or PEGASUS-large fine-tuned

400M params, fast inference, good quality for news-style summarization

Extractive summarization

BERTSum or LexRank

No hallucination risk; selects verbatim sentences from the source

What's Next

The frontier is interactive summarization (users refine and drill into summaries conversationally), multi-modal summarization (combining text, figures, and tables), and guaranteed-faithful generation with formal verification of factual claims against source documents. Expect specialized summarization agents that produce different summary types (executive brief, technical detail, key findings) from a single document.

Benchmarks & SOTA

Related Tasks

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

Reading Comprehension

Understanding and answering questions about passages.

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

Table Question Answering

Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.

Something wrong or missing?

Help keep Text Summarization benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000