Home/Building Blocks/Long-Context Summarization

Text→Text

Long-Context Summarization

Summarize 100K+ token inputs like transcripts, hearings, or books with structured outputs.

How Long Context Summarization Works

When documents exceed your model's context window, you need strategies. From chunking approaches to modern long-context models that handle 200K+ tokens in one pass.

1. The Problem 2. Strategies 3. Interactive Demo 4. Chunking 5. Model Context 6. Code

The Problem: Documents Are Longer Than Context Windows

You have a 50-page report. You want a summary. But your model can only see 4,096 tokens at once. What do you do?

Your Document: 9,350 tokens

Executive Summary450

Market Analysis1200

Product Development1800

Financial Performance2100

Customer Insights900

Strategic Initiatives1500

Risk Assessment800

Q4 Outlook600

Model Context Windows

GPT-3.5 (4K)

Too small

GPT-4 (128K)

Fits easily

Claude (200K)

Fits with room

The Old Problem

With 4K-16K context windows, you had no choice but to chunk documents, summarize pieces separately, and somehow combine them. Information was inevitably lost.

The New Reality

Claude, GPT-4, and Gemini handle 128K-1M tokens. Most documents fit in one pass. Chunking strategies are now for edge cases, not the default.

Summarization Strategies

Four approaches to summarizing long documents, from chunking-based methods to direct long-context processing.

Map-Reduce

Summarize each chunk, then summarize the summaries

Very long documents...

Split the document into chunks that fit your model. Summarize each chunk independently (map phase). Then combine all chunk summaries and summarize again (reduce phase).

Flow

Split

Doc -> [C1, C2, C3, C4]

Map

[C1, C2, C3, C4] -> [S1, S2, S3, S4]

Combine

[S1, S2, S3, S4] -> Combined

Reduce

Combined -> Final Summary

Pros

+ Parallelizable
+ Works with any model
+ Handles arbitrarily long docs

Cons

- Loses cross-chunk context
- Multiple API calls
- Can miss connections between sections

Best For

Very long documents (100K+ tokens) with independent sections

Which Strategy Should You Use?

Default Choice

Long-Context Direct

If doc fits in context, use this.

Very Long Docs

Map-Reduce

100K+ tokens, parallel processing.

Narratives

Refine

Order matters, context flows.

Structured Docs

Hierarchical

Papers, reports with sections.

Interactive: Watch Map-Reduce in Action

See how a long document is chunked, each chunk summarized, and summaries combined.

Document Chunks (Map Phase)

Executive Summary

450 tokens

Waiting

Market Analysis

1200 tokens

Waiting

Product Development

1800 tokens

Waiting

Financial Performance

2100 tokens

Waiting

Customer Insights

900 tokens

Waiting

Strategic Initiatives

1500 tokens

Waiting

Risk Assessment

800 tokens

Waiting

Q4 Outlook

600 tokens

Waiting

Current Phase

Split into ChunksActive

Map: Summarize Each

Combine Summaries

Reduce: Final Summary

Complete

The Key Insight

Map-Reduce trades completeness for scalability. Each chunk is summarized in isolation, so cross-chunk references can be lost. Use this when documents are truly too long for direct processing, or when you need parallel execution for speed.

Chunking Strategies: How to Split Documents

When you must chunk, how you split matters. The wrong boundaries can cut ideas in half.

Fixed Size

Split every N tokens, with optional overlap

chunks = text_splitter.split_text(doc, chunk_size=4000, overlap=200)

+Simple, predictable

-Cuts mid-sentence/thought

Recursive Character

Try to split at paragraphs, then sentences, then words

RecursiveCharacterTextSplitter(separators=["\n\n", "\n", ". ", " "])

+Preserves natural breaks

-Uneven chunk sizes

Semantic

Use embeddings to find natural topic boundaries

SemanticChunker(embeddings=OpenAIEmbeddings(), breakpoint_threshold=0.3)

+Coherent topics per chunk

-Slower, needs embedding model

Document Structure

Split by headers, sections, or document structure

MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "H1"), ("##", "H2")])

+Preserves document hierarchy

-Needs structured input

Chunk Size Recommendations

Summarization

2,000 - 4,000 tokens

Enough context for meaningful summary

RAG Retrieval

256 - 1,024 tokens

Smaller for precise matching

Overlap

10-20% of chunk size

Preserves context at boundaries

Model Context Windows (2024-2025)

Know your model's limits. Modern long-context models have fundamentally changed the game.

GPT-3.5 Turbo

OpenAI

16K tokens

~33 pages

GPT-4 Turbo

OpenAI

128K tokens

~256 pages

GPT-4o

OpenAI

128K tokens

~256 pages

Claude 3 Sonnet

Anthropic

200K tokens

~400 pages

Claude 3.5 Sonnet

Anthropic

200K tokens

~400 pages

Gemini 1.5 Pro

Google

1M tokens

~2000 pages

Gemini 1.5 Flash

Google

1M tokens

~2000 pages

Llama 3.1 405B

Code Examples

From LangChain's built-in chains to direct long-context API calls.

Long-Context (Claude)

Recommended

from anthropic import Anthropic

client = Anthropic()

def summarize_long_document(document: str, max_summary_words: int = 500) -> str:
    """
    Summarize a long document using Claude's 200K context window.
    No chunking needed for most documents.
    """
    response = client.messages.create(
        model="claude-sonnet-4-20250514",  # or claude-3-5-sonnet
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": f"""Summarize the following document in approximately
{max_summary_words} words. Focus on:
1. Key findings and conclusions
2. Important data points and metrics
3. Action items or recommendations

Document:
{document}

Summary:"""
        }]
    )
    return response.content[0].text

# For very long documents (100K+ tokens), still simple:
with open("long_report.txt") as f:
    full_document = f.read()

summary = summarize_long_document(full_document)

Recommended: Start Here

Use direct long-context calls with Claude or GPT-4 Turbo for most documents. Only reach for chunking strategies when documents exceed 200K tokens.

Long-Context (Claude)

For Very Long Documents

When you have documents longer than any model's context, use Map-Reduce for parallel processing or Hierarchical for structured content.

Map-Reduce, Hierarchical

Quick Reference

Modern Approach

- Use long-context models by default
- Claude 200K / GPT-4 128K fits most docs
- Gemini 1M for truly massive content
- No chunking = no information loss

When You Must Chunk

- Map-Reduce for parallel processing
- Refine for sequential narratives
- Hierarchical for structured docs
- Use semantic chunking when possible

Best Practices

- Check doc length before choosing strategy
- Place key content at start/end
- Use overlap to preserve context
- Consider cost: long context = more tokens

Use Cases

✓Hour-long meetings
✓Earnings calls
✓Legal discovery
✓Book/episode recaps

Architectural Patterns

Sliding-Window + Merge

Chunk then merge summaries hierarchically.

Native Long-Context LLM

Directly ingest long sequences (1M+ tokens).

Implementations

API Services

Gemini 1.5 Pro

Google

API

1M context for very long inputs.

Claude 3.5 Sonnet 200K

Anthropic

API

High-quality long-context summaries.

Open Source

LLama 3.1 70B 128K

Llama 3.1 Community

Open Source

Open long-context option.

HuggingFace

Benchmarks

GovReport →BookSum →

Quick Facts

Input: Text
Output: Text
Implementations: 1 open source, 2 API
Patterns: 2 approaches

Related Blocks

Have benchmark data?

Help us track the state of the art for long-context summarization.

Submit Results