Text Summarization
Text summarization compresses documents while preserving key information — a task that became dramatically more capable with LLMs but also harder to evaluate. PEGASUS (2020) and BART set the encoder-decoder baseline, but GPT-4 and Claude produce summaries that human evaluators often prefer over reference summaries, breaking ROUGE as a meaningful metric. CNN/DailyMail and XSum remain standard benchmarks, but the field is moving toward long-document summarization (books, legal filings, earnings calls) where 100K+ token context windows are finally making single-pass summarization feasible. The core unsolved problem is faithfulness — even frontier models hallucinate facts in roughly 5-15% of summaries, making factual consistency the critical metric that separates production-ready from demo-ready.
Text summarization condenses documents into shorter versions while preserving key information. LLMs have largely replaced fine-tuned models for abstractive summarization, with Claude and GPT-4 producing human-preferred summaries. The remaining challenges are faithfulness (no hallucinated facts), long-document handling, and evaluation metrics that actually correlate with quality.
History
ROUGE metric (Lin) becomes the standard automatic evaluation for summarization, despite known limitations
Rush et al. introduce neural abstractive summarization with attention-based seq2seq models
See et al.'s Pointer-Generator network addresses OOV words and repetition in abstractive summarization
BART (Lewis et al.) and T5 (Raffel et al.) achieve SOTA on CNN/DailyMail through denoising pretraining
PEGASUS (Zhang et al., Google) introduces gap-sentence pretraining specifically designed for summarization
Longformer and LED extend transformer context to 16K tokens for long-document summarization
InstructGPT and ChatGPT produce summaries preferred by humans over fine-tuned models, challenging ROUGE-based evaluation
Claude 100K and GPT-4 32K enable summarization of entire books and reports in a single pass
Gemini 1.5 (1M tokens) and Claude 3.5 handle document collections; faithfulness evaluation (FactScore, AlignScore) matures
How Text Summarization Works
Document encoding
The full document is encoded by the transformer; long documents may be chunked or use sparse attention patterns
Content selection
Extractive approaches select key sentences; abstractive models learn implicitly which content to include through attention
Abstractive generation
The decoder generates a summary token by token, paraphrasing and compressing the source material
Faithfulness control
Advanced systems add post-hoc fact verification or constrained decoding to prevent hallucinated facts
Length control
Summary length is controlled via max tokens, length penalties in beam search, or explicit instructions in LLM prompts
Current Landscape
Summarization in 2025 has been transformed by LLMs. Fine-tuned models like BART and PEGASUS are now legacy for high-quality summarization — GPT-4 and Claude produce summaries that humans prefer 70%+ of the time. The real innovation is in evaluation: ROUGE is being supplemented by LLM-as-judge and factual consistency metrics (FactScore, SUMMAC). Production systems increasingly use LLMs for quality and extractive methods as fallback for guaranteed faithfulness.
Key Challenges
Faithfulness: abstractive models hallucinate facts not present in the source document ~30% of the time in standard benchmarks
ROUGE is a poor proxy for summary quality — it measures n-gram overlap, not informativeness or coherence
Long-document summarization (books, legal filings, reports) still suffers from information loss in the middle of context windows
Multi-document summarization with conflicting information requires reconciliation and source attribution
Domain-specific summarization (medical, legal) needs terminology precision that general models lack
Quick Recommendations
Best quality (general)
Claude 3.5 Sonnet or GPT-4o
Human-preferred summaries with strong faithfulness; handles documents up to 100K+ tokens
Scientific paper summarization
Claude 3.5 with structured prompting
Can summarize full papers including methods and results with technical accuracy
Production (self-hosted)
BART-large-CNN or PEGASUS-large fine-tuned
400M params, fast inference, good quality for news-style summarization
Extractive summarization
BERTSum or LexRank
No hallucination risk; selects verbatim sentences from the source
What's Next
The frontier is interactive summarization (users refine and drill into summaries conversationally), multi-modal summarization (combining text, figures, and tables), and guaranteed-faithful generation with formal verification of factual claims against source documents. Expect specialized summarization agents that produce different summary types (executive brief, technical detail, key findings) from a single document.
Benchmarks & SOTA
Related Tasks
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Reading Comprehension
Understanding and answering questions about passages.
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.
Something wrong or missing?
Help keep Text Summarization benchmarks accurate. Report outdated results, missing benchmarks, or errors.