Tokens & Context Windows
How LLMs read text. Not characters, not words — tokens. Understanding this changes how you prompt, how you pay, and what you can build.
50 Years of Teaching Machines to Read
Before an LLM can reason about your text, it must first break it into pieces it can process. This problem — how to segment language into computable units — is older than deep learning itself. Every generation of NLP has reinvented the answer, each time getting closer to a universal segmentation that works across every language, every script, every domain.
Understanding this history isn't academic trivia. It explains why "ChatGPT can't count letters," why Japanese costs more tokens than English, why code-generation models need special vocabularies, and why a 128K context window doesn't mean 128K words.
Character-Level Processing
The earliest NLP systems processed text character by character. This is the simplest possible approach: ASCII gives you a fixed vocabulary of 128 symbols (or 256 for extended ASCII). Every character maps to a number. No ambiguity, no preprocessing.
# Character-level: simple but information-poor "cat" → [99, 97, 116] # ASCII codes "dog" → [100, 111, 103] # No semantic relationship preserved # Problem: 99 ≠ 100 tells us nothing about cat vs dog # Problem: "c" has no meaning without "a" and "t" after it
The fatal problem: characters carry almost no semantic information individually. The letter "c" means nothing by itself — its meaning depends entirely on the characters around it. This forces the model to do enormous work reconstructing word boundaries and meanings from sequences of arbitrary symbols. Character-level models need far more layers and training data to match word-level performance.
Word-Level Tokenization
The next approach: split on whitespace and punctuation. Each word gets its own ID. Gerard Salton's SMART system at Cornell (1971) and the entire tf-idf tradition treated words as atomic units. Every statistical NLP system from the 1970s through the mid-2010s — n-gram models, HMMs, CRFs, even early neural networks — used word-level vocabularies.
# Word-level: semantically rich but combinatorially explosive "the cat sat" → [1, 42, 891] # Each word = one ID "unbelievable" → [28493] # One token, but... "unbelievably" → [39201] # Different token! No shared info "Transformerization" → [UNK] # Not in vocabulary → unknown token # Vocabulary must contain every word you'll ever see
Word-level tokenization has three devastating problems. First, vocabulary explosion: English has ~170,000 words in current use, but with inflections, compound words, proper nouns, and technical jargon, a large corpus easily produces 1M+ unique tokens. Each requires its own embedding vector — a 1M x 768 embedding table has 768 million parameters just for the first layer. Second, the OOV problem: any word not in the vocabulary becomes <UNK> (unknown), destroying information. Third, morphological blindness: "run," "runs," "running," and "runner" are treated as four completely unrelated symbols.
Byte Pair Encoding (BPE)
Philip Gage published a short article in C Users Journaldescribing a data compression algorithm that would, two decades later, become the foundation of how every major LLM reads text. The algorithm is elegant in its simplicity:
- Start with a vocabulary of individual bytes (256 symbols)
- Count all adjacent pairs of symbols in the training corpus
- Merge the most frequent pair into a new symbol
- Repeat until vocabulary reaches desired size (e.g., 50,000)
# BPE in action: learning merges from corpus
# Starting vocabulary: all individual characters
# Corpus: "low lower lowest low low lower"
# Iteration 1: most frequent pair is ('l', 'o') → merge into 'lo'
# Iteration 2: most frequent pair is ('lo', 'w') → merge into 'low'
# Iteration 3: most frequent pair is ('e', 'r') → merge into 'er'
# Iteration 4: most frequent pair is ('low', 'er') → merge into 'lower'
# ...continue until vocab_size reached
# Result: "unbelievable" → ["un", "believ", "able"]
# All three subwords reusable across the vocabulary!— Gage, P. (1994). A New Algorithm for Data Compression. C Users Journal, 12(2), 23–38.
Gage's algorithm was designed for file compression, not NLP. It would take 21 years before anyone thought to apply it to machine translation.
Sennrich: BPE for Neural Machine Translation
Rico Sennrich, Barry Haddow, and Alexandra Birch at the University of Edinburgh had a breakthrough insight: BPE's compression algorithm could solve the open-vocabulary problem in neural machine translation. Instead of compressing bytes, apply it to characters in a text corpus. The resulting subword vocabulary sits in the sweet spot between characters and words — small enough to be tractable (32K–100K tokens), rich enough to preserve meaning, and capable of representing any string, even words never seen in training.
"The main motivation for subword units is the translation of rare and unseen words... BPE allows for the representation of an open vocabulary through a fixed-size vocabulary of variable-length character sequences, making it a very compact representation."
— Sennrich, R. et al. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL. 10,000+ citations.
This single paper changed everything. Within three years, every major NLP system adopted subword tokenization. GPT-1, GPT-2, GPT-3, RoBERTa — all use BPE. The idea that LLMs process "subword tokens" rather than words or characters traces directly to Sennrich 2015.
SentencePiece: Language-Agnostic Tokenization
Taku Kudo and John Richardson at Google released SentencePiece, solving a critical limitation of BPE: it assumed pre-tokenized (whitespace-split) input, which doesn't work for languages like Chinese, Japanese, and Thai that don't use spaces between words.
SentencePiece treats the input as a raw byte stream — no language-specific pre-processing required. It implements both BPE and a new algorithm called Unigram Language Model(Kudo, 2018), which starts with a large vocabulary and prunes it down using a likelihood-based criterion rather than greedily merging pairs.
# SentencePiece: works on raw text, any language
import sentencepiece as spm
# Train a tokenizer from raw corpus
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='my_tokenizer',
vocab_size=32000,
model_type='unigram' # or 'bpe'
)
sp = spm.SentencePieceProcessor(model_file='my_tokenizer.model')
# Works on any language without pre-tokenization
sp.encode("Hello world", out_type=str) # ['▁Hello', '▁world']
sp.encode("東京は晴れです", out_type=str) # ['▁東京', 'は', '晴れ', 'です']
sp.encode("Привет мир", out_type=str) # ['▁При', 'вет', '▁мир']
# The ▁ (U+2581) marks word boundaries— Kudo, T. & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer. EMNLP.
— Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models. ACL.
SentencePiece became the tokenizer of choice for multilingual models. T5, LLaMA, Gemma, and Mistral all use SentencePiece. BERT used a related algorithm called WordPiece (Schuster & Nakajima, 2012), which is similar to BPE but selects merges by maximizing likelihood rather than frequency.
GPT-2 & Byte-Level BPE
OpenAI's GPT-2 introduced byte-level BPE: instead of starting from characters (which vary by encoding), start from the 256 raw bytes of UTF-8. This guarantees that any byte sequence can be tokenized — no <UNK> token needed, ever. Every OpenAI model since (GPT-3, GPT-4, GPT-5) uses this approach, implemented in the open-sourcetiktoken library.
Every Major Model, Different Tokenizer
There is no universal tokenizer. Each model family trains its own vocabulary on its own data distribution, which means the same text produces different token counts across models. This has direct cost and context-window implications.
GPT-4 / GPT-5 (cl100k / o200k)
Byte-level BPE. 100K–200K vocab. OpenAI's tiktoken library.
Claude (Anthropic)
Custom BPE tokenizer. Not publicly documented in detail. ~100K vocab.
LLaMA / Mistral
SentencePiece (BPE mode). 32K vocab. Expanded to 128K in Llama 3+.
Gemini (Google)
SentencePiece (Unigram). 256K vocab. Optimized for 100+ languages.
The Context Window Explosion
The original Transformer (Vaswani et al., 2017) had a context window of 512 tokens. Self-attention scales as O(n²) with sequence length, making longer contexts prohibitively expensive. A decade of architectural innovations has expanded this by nearly four orders of magnitude:
Key innovations that enabled this growth: FlashAttention (Dao et al., 2022) reduced the memory footprint of attention from O(n²) to O(n) by restructuring the computation to be IO-aware. Rotary Position Embeddings (RoPE) (Su et al., 2021) allowed position information to extrapolate beyond training length. Ring Attention (Liu et al., 2023) distributed attention across multiple devices. Each breakthrough unlocked the next doubling of context size.
— Dao, T. et al. (2022). FlashAttention. NeurIPS.
— Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv.
— Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS.
The throughline: 1960s → 2026
Five decades. One problem — how to segment text — refined relentlessly:
What is a Token?
When you send text to an LLM, it does not see characters or words. It sees tokens — subword units that were learned by running BPE (or a variant) on the model's training corpus. A token might be:
- -A whole common word:
"the"(merged early because it's frequent) - -Part of a word:
"un"+"believ"+"able"(each piece is its own token) - -A single character:
"X"(too rare to have been merged further) - -A number or punctuation:
"123"or"!"
Each subword piece maps to a unique integer ID in the vocabulary. The model never sees the string "unbelievable" -- it sees the sequence [359, 49146, 481].
Rule of Thumb
In English, 1 token is approximately 4 characters or 0.75 words. But this varies significantly by language and content type — a consequence of training data distribution, not a universal constant.
Why this matters for cost
API providers charge per token, not per character or word. If your application serves Japanese users, the same semantic content costs 2-3x more than English because it requires 2-3x more tokens. Korean, Thai, and Arabic also suffer from this "tokenization tax." This is an active area of research — newer tokenizers with larger vocabularies (200K+) are narrowing the gap.
Tokenization in Action
Different tokenizers split text differently. Here is how GPT-4's tokenizer (cl100k_base) handles common inputs:
Note: " world" includes the leading space — BPE treats spaces as part of the token, not as separators.
"Tokenization" splits into "Token" + "ization" — the BPE merge rules learned that "ization" is a common suffix worth its own token.
Code uses more tokens per character than English prose — brackets, quotes, and operators are each 1 token despite being 1 character.
The equivalent English ("Tokyo Tower is 333 meters tall") would be ~7 tokens. Japanese takes ~1.7x more tokens for the same information.
Same text, different tokenizers
"Artificial intelligence is transforming healthcare systems worldwide."
Each tokenizer learns different merge rules from different training data. Larger vocabularies (200K vs 100K) tend to produce fewer tokens for the same text, which means lower cost per API call.
How BPE Builds a Vocabulary
Understanding BPE mechanically — not just conceptually — explains most tokenization surprises. Here is the algorithm running on a tiny corpus:
import re, collections
def get_stats(vocab):
"""Count frequency of adjacent pairs in vocabulary."""
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[symbols[i], symbols[i+1]] += freq
return pairs
def merge_vocab(pair, vocab):
"""Merge all occurrences of the most frequent pair."""
bigram = re.escape(' '.join(pair))
pattern = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
new_vocab = {}
for word in vocab:
new_word = pattern.sub(''.join(pair), word)
new_vocab[new_word] = vocab[word]
return new_vocab
# Training corpus (word frequencies)
vocab = {
'l o w </w>': 5, # "low" appears 5 times
'l o w e r </w>': 2, # "lower" appears 2 times
'n e w e s t </w>': 6,# "newest" appears 6 times
'w i d e s t </w>': 3 # "widest" appears 3 times
}
num_merges = 10
for i in range(num_merges):
pairs = get_stats(vocab)
if not pairs:
break
best = max(pairs, key=pairs.get)
vocab = merge_vocab(best, vocab)
print(f"Merge #{i+1}: {best[0]} + {best[1]} → {''.join(best)}")
# Output:
# Merge #1: e + s → es (freq: 9)
# Merge #2: es + t → est (freq: 9)
# Merge #3: est + </w> → est</w> (freq: 9)
# Merge #4: l + o → lo (freq: 7)
# Merge #5: lo + w → low (freq: 7)
# Merge #6: n + e → ne (freq: 6)
# Merge #7: ne + w → new (freq: 6)
# Merge #8: new + est</w> → newest</w> (freq: 6)
# ...
# Result: "newest" is a SINGLE token. "lower" = "low" + "er" + "</w>"Why LLMs struggle with character-level tasks
BPE explains why ChatGPT cannot reliably count the number of letters in a word or reverse a string. The model never sees individual characters — it sees tokens like "straw" and "berry", not s-t-r-a-w-b-e-r-r-y. To count letters, it would need to decompose tokens back into characters, which is not a natural operation in token-space. This is not a bug in the model; it is a fundamental consequence of the tokenization design choice.
Counting Tokens in Python
Use tiktoken for OpenAI models or the HuggingFace transformers tokenizer for open-source models. Each model family uses a specific encoding.
import tiktoken
# Get the encoding for GPT-4 / GPT-5
enc = tiktoken.encoding_for_model("gpt-4")
# Encode text to tokens
text = "Hello, world!"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token count: {len(tokens)}") # Output: 4
print(f"Token IDs: {tokens}") # Output: [9906, 11, 1917, 0]
# Decode individual tokens to see the subwords
for tid in tokens:
print(f" Token {tid} → '{enc.decode([tid])}'")
# Decode tokens back to text (lossless round-trip)
decoded = enc.decode(tokens)
assert decoded == text # Always true — BPE is lossless
# Compare encodings across model families
for model in ["gpt-4", "gpt-3.5-turbo"]:
enc = tiktoken.encoding_for_model(model)
count = len(enc.encode("Machine learning is transforming industries."))
print(f"{model}: {count} tokens")Common Encodings
| Encoding | Models | Vocab Size | Algorithm |
|---|---|---|---|
| cl100k_base | GPT-4, GPT-3.5-turbo, text-embedding-3-* | 100,256 | Byte-level BPE |
| o200k_base | GPT-5, o1, o3 | 200,000 | Byte-level BPE |
| p50k_base | Codex models, text-davinci-003 | 50,281 | Byte-level BPE |
The jump from 100K to 200K vocabulary in o200k_base was specifically designed to improve tokenization efficiency for non-English languages, reducing the "tokenization tax" for multilingual applications.
Understanding Context Windows
The context window is the maximum number of tokens a model can process in a single request. This includes both your input (prompt) AND the model's output (completion). It is the fundamental constraint that shapes every LLM application.
If your prompt is 3,000 tokens and the model has a 4K context window, you only have ~1,000 tokens left for the response. Exceed the window, and the model either truncates your input or refuses the request.
Context Window = Input + Output
In a long conversation, the history grows until it crowds out space for the response. Chat applications must truncate or summarize older messages to stay within the window.
| Model | Context Window | Approx. Pages | Use Case |
|---|---|---|---|
| GPT-5 | 256K tokens | ~384 pages | Flagship, long documents |
| GPT-5 | 128K tokens | ~192 pages | Cost-effective general use |
| Claude Opus 4.6 | 200K / 1M tokens | ~300 / ~1500 pages | Coding, analysis, long context |
| Claude Sonnet 4.6 | 200K tokens | ~300 pages | Best value for most tasks |
| Gemini 3 Pro | 1M / 2M tokens | ~1500 / ~3000 pages | Massive context tasks |
| Llama 4 (17B/405B) | 128K tokens | ~192 pages | Self-hosted, private |
Why Context Size Matters
Context length determines what problems your LLM can solve. But bigger is not always better — there are real trade-offs that practitioners must navigate.
More Context = More Information
With 128K tokens, you can include entire codebases, long documents, or extensive conversation history. The model sees everything at once — no chunking, no retrieval pipeline, no information loss.
RAG vs Long Context
Long context can replace RAG for some use cases. Instead of building a retrieval pipeline to find relevant chunks, just put the whole document in the prompt. Simpler architecture, fewer failure modes.
Attention Degradation
Models may lose focus in very long contexts. The "lost in the middle" problem: information at the start and end is recalled better than information in the middle.
Cost Scales Linearly
More tokens = higher cost. A 100K token request costs 10x more than a 10K token request. And latency scales too — first-token time increases with context length because the model must process all input tokens before generating the first output token.
Token Pricing (2026)
API providers charge per token, typically quoted per 1 million tokens. Output tokens are always more expensive than input — typically 2-5x more — because generation is sequential (autoregressive) while input processing is parallelizable.
| Model | Provider | Input ($/1M) | Output ($/1M) | Context | Output Ratio |
|---|---|---|---|---|---|
| GPT-5 | OpenAI | $2.00 | $8.00 | 256K | 4x |
| GPT-4o-mini | OpenAI | $0.15 | $0.60 | 128K | 4x |
| o3 | OpenAI | $10.00 | $40.00 | 200K | 4x |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | 200K | 5x |
| Claude Haiku 4.5 | Anthropic | $0.80 | $4.00 | 200K | 5x |
| Gemini 3 Pro | $1.25 | $5.00 | 2M | 4x | |
| Gemini 3 Flash | $0.075 | $0.30 | 1M | 4x | |
| Llama 4 405B | Meta / Together | $0.80 | $0.80 | 128K | 1x |
| DeepSeek V3 | DeepSeek | $0.27 | $1.10 | 128K | 4x |
Prices as of early 2026. Open-weight models (Llama, DeepSeek) prices via hosted providers like Together AI and vary by host. Self-hosting can be cheaper at scale.
Pareto Frontier: Quality vs Cost
Not every expensive model is worth the price. The Pareto frontier shows models where no other model is both cheaper and better. Everything on or near this line is an efficient choice — everything far below it is overpriced for its quality.
Reading the Frontier
The dashed line connects models on the Pareto frontier — the optimal set where no other model offers both lower cost and higher quality. Hover over any dot to see details.
Context Window vs Cost
Larger context windows let you process more information in one call — but does bigger always mean more expensive? The relationship is not linear. Some providers offer enormous context windows at surprisingly low prices.
The largest context window available. At $1.25/1M tokens, you can process an entire book for under $3.
1M context at $0.075/1M tokens. Process a 750K-token document for just $0.056. The best context-per-dollar ratio.
200K context at $0.80/1M. The sweet spot for complex reasoning tasks that need substantial context.
The Real Cost Equation
Many developers only look at input price. That is a mistake. The actual formula has two parts, and the output side often dominates.
Why output tokens cost more
Reading input is parallelizable — the model processes all tokens at once via matrix multiplication. Generating output is sequential — each token depends on the previous one (autoregressive decoding). This makes output 2-5x more compute-intensive, and pricing reflects that.
Concrete Examples
| Model | Input Cost | Output Cost | Total |
|---|---|---|---|
| GPT-5 | $0.094 | $0.005 | $0.099 |
| GPT-4o-mini | $0.006 | $0.0003 | $0.006 |
| Claude Sonnet 4.6 | $0.113 | $0.008 | $0.120 |
| Gemini 3 Flash | $0.003 | $0.0002 | $0.003 |
For summarization (high input, low output), input price dominates. GPT-4o-mini is 16x cheaper than GPT-5.
| Model | Input Cost | Output Cost | Total |
|---|---|---|---|
| GPT-5 | $0.0005 | $0.027 | $0.028 |
| GPT-4o-mini | $0.00003 | $0.002 | $0.002 |
| Claude Sonnet 4.6 | $0.0006 | $0.041 | $0.041 |
For generation (low input, high output), output price dominates. Claude Sonnet 4.6 costs 1.5x more than GPT-5 here because of its 5x output multiplier.
When to use expensive vs cheap models
- - Text classification and sentiment
- - Data extraction from structured docs
- - Simple summarization
- - Embedding generation
- - First-pass filtering in pipelines
- - Complex multi-step reasoning
- - Code generation and debugging
- - Nuanced writing and editing
- - Math and logic problems
- - Tasks where errors are costly
Cost Optimization Strategies
Beyond picking the right model, there are structural techniques that can cut your LLM costs by 50-90%. The best production systems use several of these together.
Prompt Caching
Anthropic, Google, and OpenAI all offer prompt caching where repeated prefixes of your prompt are cached and charged at 10% of the normal input price. If your system prompt is 4,000 tokens and you send 1,000 queries, you pay full price once and 90% less for the other 999.
Available on Claude Sonnet 4.6 and Haiku 4.5. Google also offers context caching on Gemini. OpenAI supports it on GPT-5 and o3.
Batch API
OpenAI's Batch API processes requests asynchronously within a 24-hour window at 50% off. Perfect for non-real-time tasks like content moderation, data labeling, or nightly report generation.
Model Routing
Route easy queries to cheap models and hard queries to expensive ones. A simple classifier (or even the cheap model itself) decides the complexity. In practice, 70-80% of queries can be handled by the small model.
Context Compression
Reduce input tokens without losing information. Techniques include:
Use a cheap model to summarize long documents, then query the summary with an expensive model. Cuts input by 80-90%.
For conversations, keep only the last N messages plus a rolling summary. Prevents context from growing unbounded.
Retrieve only relevant chunks instead of dumping entire documents into context. 10 relevant paragraphs beat 100 pages.
Remove HTML, markdown, and whitespace from documents before sending. Can reduce token count by 20-40% on web content.
Stacking these strategies
Combine routing (70% savings) + caching (90% savings on cached portion) + batch API (50% savings on async tasks) and you can realistically reduce costs by 80-95% compared to sending everything to the most expensive model in real-time.
Practical Tips
1. Count Before You Send
Always count tokens before making API calls, especially with user-provided content. A single function can save you from truncated responses and unexpected bills.
import tiktoken
def check_token_limit(text: str, model: str = "gpt-4", max_tokens: int = 8000) -> dict:
enc = tiktoken.encoding_for_model(model)
token_count = len(enc.encode(text))
return {
"count": token_count,
"within_limit": token_count <= max_tokens,
"remaining": max_tokens - token_count
}2. Set max_tokens for Outputs
Always set max_tokens in your API calls to prevent runaway costs and ensure you stay within context limits. Without it, the model may generate thousands of tokens when you only need a short answer — and you pay for every one.
3. Truncate Strategically
When content exceeds limits, truncate intelligently — keep the most relevant parts. For conversations, keep the system message + first message (task definition) + recent messages. For documents, keep the first and last sections (where key information tends to cluster).
4. Watch for Tokenization Surprises
Common pitfalls that waste tokens or produce unexpected behavior:
"123456789" may become 3+ tokens. Format large numbers carefully.
Extra newlines, indentation in prompts — all cost tokens. Minify verbose prompts.
CJK characters: 2-3x more tokens. Arabic, Cyrillic: ~1.5x. Budget accordingly.
Brackets, operators, indentation — code uses 2-3x more tokens per semantic unit than prose.
Tools and Resources
OpenAI Tokenizer
Official web tool to visualize how text gets tokenized. Paste any text and see the exact token boundaries. Essential for debugging.
platform.openai.com/tokenizertiktoken Library
Fast Python library for tokenizing text for OpenAI models. Production-grade, written in Rust for speed.
pip install tiktokenClaude Tokenization
Anthropic's documentation on Claude tokenization. Different tokenizer than OpenAI — use their API to count tokens accurately.
docs.anthropic.comHuggingFace Tokenizers
Deep dive into tokenizer algorithms: BPE, WordPiece, Unigram, and SentencePiece. The definitive reference.
huggingface.co/docsKey Takeaways
- 1
Tokens are subwords, not characters or words — LLMs see text as token sequences learned via BPE. Common words are one token; rare words split into pieces. This is why LLMs struggle with letter counting and string reversal.
- 2
Context window = input + output — Budget your tokens. A 128K context does not mean 128K input if you need a long response.
- 3
Output tokens cost 2-5x more than input — For generation-heavy tasks, the output price dominates. Factor this into model selection.
- 4
Use the Pareto frontier to choose models — GPT-4o-mini, Claude Haiku 4.5, and Gemini Flash sit on the efficiency frontier. Models below the line are overpriced.
- 5
Stack optimization strategies — Caching + routing + batching can reduce costs 80-95%. The best systems never send everything to the most expensive model.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.