Home/Building Blocks/Long-Context Summarization
TextText

Long-Context Summarization

Summarize 100K+ token inputs like transcripts, hearings, or books with structured outputs.

How Long Context Summarization Works

When documents exceed your model's context window, you need strategies. From chunking approaches to modern long-context models that handle 200K+ tokens in one pass.

1

The Problem: Documents Are Longer Than Context Windows

You have a 50-page report. You want a summary. But your model can only see 4,096 tokens at once. What do you do?

Your Document: 9,350 tokens
Executive Summary450
Market Analysis1200
Product Development1800
Financial Performance2100
Customer Insights900
Strategic Initiatives1500
Risk Assessment800
Q4 Outlook600
Model Context Windows
GPT-3.5 (4K)
Too small
GPT-4 (128K)
Fits easily
Claude (200K)
Fits with room
The Old Problem

With 4K-16K context windows, you had no choice but to chunk documents, summarize pieces separately, and somehow combine them. Information was inevitably lost.

The New Reality

Claude, GPT-4, and Gemini handle 128K-1M tokens. Most documents fit in one pass. Chunking strategies are now for edge cases, not the default.

2

Summarization Strategies

Four approaches to summarizing long documents, from chunking-based methods to direct long-context processing.

Map-Reduce

Summarize each chunk, then summarize the summaries

Very long documents...

Split the document into chunks that fit your model. Summarize each chunk independently (map phase). Then combine all chunk summaries and summarize again (reduce phase).

Flow
Split
Doc -> [C1, C2, C3, C4]
->
Map
[C1, C2, C3, C4] -> [S1, S2, S3, S4]
->
Combine
[S1, S2, S3, S4] -> Combined
->
Reduce
Combined -> Final Summary
Pros
  • + Parallelizable
  • + Works with any model
  • + Handles arbitrarily long docs
Cons
  • - Loses cross-chunk context
  • - Multiple API calls
  • - Can miss connections between sections
Best For

Very long documents (100K+ tokens) with independent sections

Which Strategy Should You Use?

Default Choice
Long-Context Direct
If doc fits in context, use this.
Very Long Docs
Map-Reduce
100K+ tokens, parallel processing.
Narratives
Refine
Order matters, context flows.
Structured Docs
Hierarchical
Papers, reports with sections.
3

Interactive: Watch Map-Reduce in Action

See how a long document is chunked, each chunk summarized, and summaries combined.

Document Chunks (Map Phase)
Executive Summary
450 tokens
Waiting
Market Analysis
1200 tokens
Waiting
Product Development
1800 tokens
Waiting
Financial Performance
2100 tokens
Waiting
Customer Insights
900 tokens
Waiting
Strategic Initiatives
1500 tokens
Waiting
Risk Assessment
800 tokens
Waiting
Q4 Outlook
600 tokens
Waiting
Current Phase
Split into ChunksActive
Map: Summarize Each
Combine Summaries
Reduce: Final Summary
Complete
The Key Insight
Map-Reduce trades completeness for scalability. Each chunk is summarized in isolation, so cross-chunk references can be lost. Use this when documents are truly too long for direct processing, or when you need parallel execution for speed.
4

Chunking Strategies: How to Split Documents

When you must chunk, how you split matters. The wrong boundaries can cut ideas in half.

Fixed Size

Split every N tokens, with optional overlap

chunks = text_splitter.split_text(doc, chunk_size=4000, overlap=200)
+Simple, predictable
-Cuts mid-sentence/thought
Recursive Character

Try to split at paragraphs, then sentences, then words

RecursiveCharacterTextSplitter(separators=["\n\n", "\n", ". ", " "])
+Preserves natural breaks
-Uneven chunk sizes
Semantic

Use embeddings to find natural topic boundaries

SemanticChunker(embeddings=OpenAIEmbeddings(), breakpoint_threshold=0.3)
+Coherent topics per chunk
-Slower, needs embedding model
Document Structure

Split by headers, sections, or document structure

MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "H1"), ("##", "H2")])
+Preserves document hierarchy
-Needs structured input

Chunk Size Recommendations

Summarization
2,000 - 4,000 tokens
Enough context for meaningful summary
RAG Retrieval
256 - 1,024 tokens
Smaller for precise matching
Overlap
10-20% of chunk size
Preserves context at boundaries
5

Model Context Windows (2024-2025)

Know your model's limits. Modern long-context models have fundamentally changed the game.

GPT-3.5 Turbo
OpenAI
16K tokens
~33 pages
GPT-4 Turbo
OpenAI
128K tokens
~256 pages
GPT-4o
OpenAI
128K tokens
~256 pages
Claude 3 Sonnet
Anthropic
200K tokens
~400 pages
Claude 3.5 Sonnet
Anthropic
200K tokens
~400 pages
Gemini 1.5 Pro
Google
1M tokens
~2000 pages
Gemini 1.5 Flash
Google
1M tokens
~2000 pages
Llama 3.1 405B
Meta
128K tokens
~256 pages
500
tokens/page
typical estimate
~250
pages @ 128K
GPT-4 Turbo
~400
pages @ 200K
Claude 3.5
~2000
pages @ 1M
Gemini 1.5 Pro
The Attention Dilution Problem

Just because a model can accept 200K tokens doesn't mean it attends equally to all of them. Research shows LLMs often "lose" information in the middle of long contexts. For critical summarization, consider placing the most important content at the beginning or end.

6

Code Examples

From LangChain's built-in chains to direct long-context API calls.

Long-Context (Claude)
Recommended
from anthropic import Anthropic

client = Anthropic()

def summarize_long_document(document: str, max_summary_words: int = 500) -> str:
    """
    Summarize a long document using Claude's 200K context window.
    No chunking needed for most documents.
    """
    response = client.messages.create(
        model="claude-sonnet-4-20250514",  # or claude-3-5-sonnet
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": f"""Summarize the following document in approximately
{max_summary_words} words. Focus on:
1. Key findings and conclusions
2. Important data points and metrics
3. Action items or recommendations

Document:
{document}

Summary:"""
        }]
    )
    return response.content[0].text

# For very long documents (100K+ tokens), still simple:
with open("long_report.txt") as f:
    full_document = f.read()

summary = summarize_long_document(full_document)
Recommended: Start Here

Use direct long-context calls with Claude or GPT-4 Turbo for most documents. Only reach for chunking strategies when documents exceed 200K tokens.

Long-Context (Claude)
For Very Long Documents

When you have documents longer than any model's context, use Map-Reduce for parallel processing or Hierarchical for structured content.

Map-Reduce, Hierarchical

Quick Reference

Modern Approach
  • - Use long-context models by default
  • - Claude 200K / GPT-4 128K fits most docs
  • - Gemini 1M for truly massive content
  • - No chunking = no information loss
When You Must Chunk
  • - Map-Reduce for parallel processing
  • - Refine for sequential narratives
  • - Hierarchical for structured docs
  • - Use semantic chunking when possible
Best Practices
  • - Check doc length before choosing strategy
  • - Place key content at start/end
  • - Use overlap to preserve context
  • - Consider cost: long context = more tokens

Use Cases

  • Hour-long meetings
  • Earnings calls
  • Legal discovery
  • Book/episode recaps

Architectural Patterns

Sliding-Window + Merge

Chunk then merge summaries hierarchically.

Native Long-Context LLM

Directly ingest long sequences (1M+ tokens).

Implementations

API Services

Gemini 1.5 Pro

Google
API

1M context for very long inputs.

Claude 3.5 Sonnet 200K

Anthropic
API

High-quality long-context summaries.

Open Source

LLama 3.1 70B 128K

Llama 3.1 Community
Open Source

Open long-context option.

Benchmarks

Quick Facts

Input
Text
Output
Text
Implementations
1 open source, 2 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for long-context summarization.

Submit Results