§ 0.3 · Foundations

Tokens & context windows.

How LLMs read text. Not characters, not words — tokens. Understanding this changes how you prompt, how you pay, and what you can build.

50 Years of Teaching Machines to Read

Before an LLM can reason about your text, it must first break it into pieces it can process. This problem — how to segment language into computable units — is older than deep learning itself. Every generation of NLP has reinvented the answer, each time getting closer to a universal segmentation that works across every language, every script, every domain.

Understanding this history isn't academic trivia. It explains why "ChatGPT can't count letters," why Japanese costs more tokens than English, why code-generation models need special vocabularies, and why a 128K context window doesn't mean 128K words.

Era I: Characters & Words

1960s–1980s

Character-Level Processing

The earliest NLP systems processed text character by character. This is the simplest possible approach: ASCII gives you a fixed vocabulary of 128 symbols (or 256 for extended ASCII). Every character maps to a number. No ambiguity, no preprocessing.

# Character-level: simple but information-poor
"cat" → [99, 97, 116]     # ASCII codes
"dog" → [100, 111, 103]   # No semantic relationship preserved
# Problem: 99 ≠ 100 tells us nothing about cat vs dog
# Problem: "c" has no meaning without "a" and "t" after it

The fatal problem: characters carry almost no semantic information individually. The letter "c" means nothing by itself — its meaning depends entirely on the characters around it. This forces the model to do enormous work reconstructing word boundaries and meanings from sequences of arbitrary symbols. Character-level models need far more layers and training data to match word-level performance.

1970s–2000s

Word-Level Tokenization

The next approach: split on whitespace and punctuation. Each word gets its own ID. Gerard Salton's SMART system at Cornell (1971) and the entire tf-idf tradition treated words as atomic units. Every statistical NLP system from the 1970s through the mid-2010s — n-gram models, HMMs, CRFs, even early neural networks — used word-level vocabularies.

# Word-level: semantically rich but combinatorially explosive
"the cat sat" → [1, 42, 891]       # Each word = one ID
"unbelievable" → [28493]           # One token, but...
"unbelievably" → [39201]           # Different token! No shared info
"Transformerization" → [UNK]       # Not in vocabulary → unknown token
# Vocabulary must contain every word you'll ever see

Word-level tokenization has three devastating problems. First, vocabulary explosion: English has ~170,000 words in current use, but with inflections, compound words, proper nouns, and technical jargon, a large corpus easily produces 1M+ unique tokens. Each requires its own embedding vector — a 1M x 768 embedding table has 768 million parameters just for the first layer. Second, the OOV problem: any word not in the vocabulary becomes <UNK> (unknown), destroying information. Third, morphological blindness: "run," "runs," "running," and "runner" are treated as four completely unrelated symbols.

Era II: The Subword Revolution

1994

Byte Pair Encoding (BPE)

Philip Gage published a short article in C Users Journaldescribing a data compression algorithm that would, two decades later, become the foundation of how every major LLM reads text. The algorithm is elegant in its simplicity:

Start with a vocabulary of individual bytes (256 symbols)
Count all adjacent pairs of symbols in the training corpus
Merge the most frequent pair into a new symbol
Repeat until vocabulary reaches desired size (e.g., 50,000)

# BPE in action: learning merges from corpus
# Starting vocabulary: all individual characters
# Corpus: "low lower lowest low low lower"

# Iteration 1: most frequent pair is ('l', 'o') → merge into 'lo'
# Iteration 2: most frequent pair is ('lo', 'w') → merge into 'low'
# Iteration 3: most frequent pair is ('e', 'r') → merge into 'er'
# Iteration 4: most frequent pair is ('low', 'er') → merge into 'lower'
# ...continue until vocab_size reached

# Result: "unbelievable" → ["un", "believ", "able"]
# All three subwords reusable across the vocabulary!

— Gage, P. (1994). A New Algorithm for Data Compression. C Users Journal, 12(2), 23–38.

Gage's algorithm was designed for file compression, not NLP. It would take 21 years before anyone thought to apply it to machine translation.

2015

Sennrich: BPE for Neural Machine Translation

Rico Sennrich, Barry Haddow, and Alexandra Birch at the University of Edinburgh had a breakthrough insight: BPE's compression algorithm could solve the open-vocabulary problem in neural machine translation. Instead of compressing bytes, apply it to characters in a text corpus. The resulting subword vocabulary sits in the sweet spot between characters and words — small enough to be tractable (32K–100K tokens), rich enough to preserve meaning, and capable of representing any string, even words never seen in training.

"The main motivation for subword units is the translation of rare and unseen words... BPE allows for the representation of an open vocabulary through a fixed-size vocabulary of variable-length character sequences, making it a very compact representation."

— Sennrich, R. et al. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL. 10,000+ citations.

This single paper changed everything. Within three years, every major NLP system adopted subword tokenization. GPT-1, GPT-2, GPT-3, RoBERTa — all use BPE. The idea that LLMs process "subword tokens" rather than words or characters traces directly to Sennrich 2015.

2018

SentencePiece: Language-Agnostic Tokenization

Taku Kudo and John Richardson at Google released SentencePiece, solving a critical limitation of BPE: it assumed pre-tokenized (whitespace-split) input, which doesn't work for languages like Chinese, Japanese, and Thai that don't use spaces between words.

SentencePiece treats the input as a raw byte stream — no language-specific pre-processing required. It implements both BPE and a new algorithm called Unigram Language Model(Kudo, 2018), which starts with a large vocabulary and prunes it down using a likelihood-based criterion rather than greedily merging pairs.

# SentencePiece: works on raw text, any language
import sentencepiece as spm

# Train a tokenizer from raw corpus
spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='my_tokenizer',
    vocab_size=32000,
    model_type='unigram'  # or 'bpe'
)

sp = spm.SentencePieceProcessor(model_file='my_tokenizer.model')

# Works on any language without pre-tokenization
sp.encode("Hello world", out_type=str)     # ['▁Hello', '▁world']
sp.encode("東京は晴れです", out_type=str)    # ['▁東京', 'は', '晴れ', 'です']
sp.encode("Привет мир", out_type=str)      # ['▁При', 'вет', '▁мир']
# The ▁ (U+2581) marks word boundaries

— Kudo, T. & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer. EMNLP.
— Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models. ACL.

SentencePiece became the tokenizer of choice for multilingual models. T5, LLaMA, Gemma, and Mistral all use SentencePiece. BERT used a related algorithm called WordPiece (Schuster & Nakajima, 2012), which is similar to BPE but selects merges by maximizing likelihood rather than frequency.

2019

GPT-2 & Byte-Level BPE

OpenAI's GPT-2 introduced byte-level BPE: instead of starting from characters (which vary by encoding), start from the 256 raw bytes of UTF-8. This guarantees that any byte sequence can be tokenized — no <UNK> token needed, ever. Every OpenAI model since (GPT-3, GPT-4, GPT-5) uses this approach, implemented in the open-sourcetiktoken library.

— Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report.

Era III: The Modern Landscape

2020–present

Every Major Model, Different Tokenizer

There is no universal tokenizer. Each model family trains its own vocabulary on its own data distribution, which means the same text produces different token counts across models. This has direct cost and context-window implications.

GPT-4 / GPT-5 (cl100k / o200k)

Byte-level BPE. 100K–200K vocab. OpenAI's tiktoken library.

Claude (Anthropic)

Custom BPE tokenizer. Not publicly documented in detail. ~100K vocab.

LLaMA / Mistral

SentencePiece (BPE mode). 32K vocab. Expanded to 128K in Llama 3+.

Gemini (Google)

SentencePiece (Unigram). 256K vocab. Optimized for 100+ languages.

2017–2026

The Context Window Explosion

The original Transformer (Vaswani et al., 2017) had a context window of 512 tokens. Self-attention scales as O(n²) with sequence length, making longer contexts prohibitively expensive. A decade of architectural innovations has expanded this by nearly four orders of magnitude:

2017Transformer

512

2018GPT-1

512

2019GPT-2

1,024

2020GPT-3

2,048

2023GPT-4

128K

2024Claude 3.5

200K

2025Gemini 2 Pro

2026Gemini 3 Pro

Key innovations that enabled this growth: FlashAttention (Dao et al., 2022) reduced the memory footprint of attention from O(n²) to O(n) by restructuring the computation to be IO-aware. Rotary Position Embeddings (RoPE) (Su et al., 2021) allowed position information to extrapolate beyond training length. Ring Attention (Liu et al., 2023) distributed attention across multiple devices. Each breakthrough unlocked the next doubling of context size.

— Dao, T. et al. (2022). FlashAttention. NeurIPS.
— Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv.
— Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS.

The throughline: 1960s → 2026

Five decades. One problem — how to segment text — refined relentlessly:

1960s–1980sCharacters: Simple but no meaning per symbol. Sequences too long for models.

1970s–2000sWords: Meaningful units but vocabulary explosion and OOV problem.

1994–2015BPE: Compression algorithm repurposed for NLP. The sweet spot between characters and words.

2018SentencePiece: Language-agnostic. No pre-tokenization needed. Works for all scripts.

2019–nowByte-level BPE: Start from raw bytes. Zero OOV. Universal coverage. Every frontier model uses a variant.

What is a Token?

When you send text to an LLM, it does not see characters or words. It sees tokens — subword units that were learned by running BPE (or a variant) on the model's training corpus. A token might be:

-A whole common word: "the" (merged early because it's frequent)
-Part of a word: "un" + "believ" + "able" (each piece is its own token)
-A single character: "X" (too rare to have been merged further)
-A number or punctuation: "123" or "!"

How a tokenizer splits "unbelievable" into subword tokens:

Each subword piece maps to a unique integer ID in the vocabulary. The model never sees the string "unbelievable" -- it sees the sequence [359, 49146, 481].

Rule of Thumb

In English, 1 token is approximately 4 characters or 0.75 words. But this varies significantly by language and content type — a consequence of training data distribution, not a universal constant.

English text:~4 chars/tokenMost efficient — dominant in training data

Code:~2-3 chars/tokenSymbols, indentation waste tokens

Non-Latin scripts:~1-2 chars/tokenJapanese/Chinese: up to 3x more tokens than English

Why this matters for cost

API providers charge per token, not per character or word. If your application serves Japanese users, the same semantic content costs 2-3x more than English because it requires 2-3x more tokens. Korean, Thai, and Arabic also suffer from this "tokenization tax." This is an active area of research — newer tokenizers with larger vocabularies (200K+) are narrowing the gap.

Tokenization in Action

Different tokenizers split text differently. Here is how GPT-4's tokenizer (cl100k_base) handles common inputs:

Input:

"Hello, world!"

Tokens (4 total):

Hello, world!

Note: " world" includes the leading space — BPE treats spaces as part of the token, not as separators.

Input:

"Tokenization is fascinating!"

Tokens (5 total):

Tokenization is fascinating!

"Tokenization" splits into "Token" + "ization" — the BPE merge rules learned that "ization" is a common suffix worth its own token.

Input (Python code):

def hello(): return "world"

Tokens (9 total):

def hello(): return "world"

Code uses more tokens per character than English prose — brackets, quotes, and operators are each 1 token despite being 1 character.

Input (Japanese):

東京タワーは333メートルです

Tokens (~12 total vs ~5 for equivalent English):

東京タワーは333メートルです

The equivalent English ("Tokyo Tower is 333 meters tall") would be ~7 tokens. Japanese takes ~1.7x more tokens for the same information.

Same text, different tokenizers

"Artificial intelligence is transforming healthcare systems worldwide."

Each tokenizer learns different merge rules from different training data. Larger vocabularies (200K vs 100K) tend to produce fewer tokens for the same text, which means lower cost per API call.

How BPE Builds a Vocabulary

Understanding BPE mechanically — not just conceptually — explains most tokenization surprises. Here is the algorithm running on a tiny corpus:

bpe_from_scratch.py

import re, collections

def get_stats(vocab):
    """Count frequency of adjacent pairs in vocabulary."""
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[symbols[i], symbols[i+1]] += freq
    return pairs

def merge_vocab(pair, vocab):
    """Merge all occurrences of the most frequent pair."""
    bigram = re.escape(' '.join(pair))
    pattern = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    new_vocab = {}
    for word in vocab:
        new_word = pattern.sub(''.join(pair), word)
        new_vocab[new_word] = vocab[word]
    return new_vocab

# Training corpus (word frequencies)
vocab = {
    'l o w </w>': 5,      # "low" appears 5 times
    'l o w e r </w>': 2,  # "lower" appears 2 times
    'n e w e s t </w>': 6,# "newest" appears 6 times
    'w i d e s t </w>': 3 # "widest" appears 3 times
}

num_merges = 10
for i in range(num_merges):
    pairs = get_stats(vocab)
    if not pairs:
        break
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(f"Merge #{i+1}: {best[0]} + {best[1]} → {''.join(best)}")

# Output:
# Merge #1: e + s → es     (freq: 9)
# Merge #2: es + t → est   (freq: 9)
# Merge #3: est + </w> → est</w> (freq: 9)
# Merge #4: l + o → lo     (freq: 7)
# Merge #5: lo + w → low   (freq: 7)
# Merge #6: n + e → ne     (freq: 6)
# Merge #7: ne + w → new   (freq: 6)
# Merge #8: new + est</w> → newest</w> (freq: 6)
# ...
# Result: "newest" is a SINGLE token. "lower" = "low" + "er" + "</w>"

Why LLMs struggle with character-level tasks

BPE explains why ChatGPT cannot reliably count the number of letters in a word or reverse a string. The model never sees individual characters — it sees tokens like "straw" and "berry", not s-t-r-a-w-b-e-r-r-y. To count letters, it would need to decompose tokens back into characters, which is not a natural operation in token-space. This is not a bug in the model; it is a fundamental consequence of the tokenization design choice.

Counting Tokens in Python

Use tiktoken for OpenAI models or the HuggingFace transformers tokenizer for open-source models. Each model family uses a specific encoding.

token_counting.py

import tiktoken

# Get the encoding for GPT-4 / GPT-5
enc = tiktoken.encoding_for_model("gpt-4")

# Encode text to tokens
text = "Hello, world!"
tokens = enc.encode(text)

print(f"Text: {text}")
print(f"Token count: {len(tokens)}")  # Output: 4
print(f"Token IDs: {tokens}")         # Output: [9906, 11, 1917, 0]

# Decode individual tokens to see the subwords
for tid in tokens:
    print(f"  Token {tid} → '{enc.decode([tid])}'")

# Decode tokens back to text (lossless round-trip)
decoded = enc.decode(tokens)
assert decoded == text  # Always true — BPE is lossless

# Compare encodings across model families
for model in ["gpt-4", "gpt-3.5-turbo"]:
    enc = tiktoken.encoding_for_model(model)
    count = len(enc.encode("Machine learning is transforming industries."))
    print(f"{model}: {count} tokens")

Common Encodings

Encoding	Models	Vocab Size	Algorithm
cl100k_base	GPT-4, GPT-3.5-turbo, text-embedding-3-*	100,256	Byte-level BPE
o200k_base	GPT-5, o1, o3	200,000	Byte-level BPE
p50k_base	Codex models, text-davinci-003	50,281	Byte-level BPE

The jump from 100K to 200K vocabulary in o200k_base was specifically designed to improve tokenization efficiency for non-English languages, reducing the "tokenization tax" for multilingual applications.

Understanding Context Windows

The context window is the maximum number of tokens a model can process in a single request. This includes both your input (prompt) AND the model's output (completion). It is the fundamental constraint that shapes every LLM application.

If your prompt is 3,000 tokens and the model has a 4K context window, you only have ~1,000 tokens left for the response. Exceed the window, and the model either truncates your input or refuses the request.

Context Window = Input + Output

Input Tokens

Your prompt + system message + context

Output Tokens

The model's response

Context Window

Maximum allowed

How a 128K context window fills up during a conversation:

In a long conversation, the history grows until it crowds out space for the response. Chat applications must truncate or summarize older messages to stay within the window.

Model	Context Window	Approx. Pages	Use Case
GPT-5	256K tokens	~384 pages	Flagship, long documents
GPT-5	128K tokens	~192 pages	Cost-effective general use
Claude Opus 4.6	200K / 1M tokens	~300 / ~1500 pages	Coding, analysis, long context
Claude Sonnet 4.6	200K tokens	~300 pages	Best value for most tasks
Gemini 3 Pro	1M / 2M tokens	~1500 / ~3000 pages	Massive context tasks
Llama 4 (17B/405B)	128K tokens	~192 pages	Self-hosted, private

Why Context Size Matters

Context length determines what problems your LLM can solve. But bigger is not always better — there are real trade-offs that practitioners must navigate.

More Context = More Information

With 128K tokens, you can include entire codebases, long documents, or extensive conversation history. The model sees everything at once — no chunking, no retrieval pipeline, no information loss.

RAG vs Long Context

Long context can replace RAG for some use cases. Instead of building a retrieval pipeline to find relevant chunks, just put the whole document in the prompt. Simpler architecture, fewer failure modes.

Attention Degradation

Models may lose focus in very long contexts. The "lost in the middle" problem: information at the start and end is recalled better than information in the middle.

— Liu, N. et al. (2023). Lost in the Middle. TACL.

Cost Scales Linearly

More tokens = higher cost. A 100K token request costs 10x more than a 10K token request. And latency scales too — first-token time increases with context length because the model must process all input tokens before generating the first output token.

Token Pricing (2026)

API providers charge per token, typically quoted per 1 million tokens. Output tokens are always more expensive than input — typically 2-5x more — because generation is sequential (autoregressive) while input processing is parallelizable.

Model	Provider	Input ($/1M)	Output ($/1M)	Context	Output Ratio
GPT-5	OpenAI	$2.00	$8.00	256K	4x
GPT-4o-mini	OpenAI	$0.15	$0.60	128K	4x
o3	OpenAI	$10.00	$40.00	200K	4x
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	200K	5x
Claude Haiku 4.5	Anthropic	$0.80	$4.00	200K	5x
Gemini 3 Pro	Google	$1.25	$5.00	2M	4x
Gemini 3 Flash	Google	$0.075	$0.30	1M	4x
Llama 4 405B	Meta / Together	$0.80	$0.80	128K	1x
DeepSeek V3	DeepSeek	$0.27	$1.10	128K	4x

Prices as of early 2026. Open-weight models (Llama, DeepSeek) prices via hosted providers like Together AI and vary by host. Self-hosting can be cheaper at scale.

Pareto Frontier: Quality vs Cost

Not every expensive model is worth the price. The Pareto frontier shows models where no other model is both cheaper and better. Everything on or near this line is an efficient choice — everything far below it is overpriced for its quality.

OpenAI

Anthropic

Google

Reading the Frontier

The dashed line connects models on the Pareto frontier — the optimal set where no other model offers both lower cost and higher quality. Hover over any dot to see details.

Best bang for buck:DeepSeek V3 at $0.27/1M delivers 86.0 MMLU — the new price/performance king.

Premium tier:GPT-5 leads at 92.0 MMLU for $2.00/1M. Claude Sonnet 4.6 matches at 91.5 but costs $3.00/1M.

Ultra-budget:Gemini 3 Flash at $0.075/1M is the cheapest frontier model with 1M context. Good enough for classification and extraction.

Off-frontier models:Llama 4 405B ($0.80, 86.0) sits near the line but Claude Haiku 4.5 matches it at the same price with better context.

Context Window vs Cost

Larger context windows let you process more information in one call — but does bigger always mean more expensive? The relationship is not linear. Some providers offer enormous context windows at surprisingly low prices.

Gemini 3 Flash*

1M tokens | $0.075/1M

Gemini 3 Pro*

2M tokens | $1.25/1M

Claude Haiku 4.5*

200K tokens | $0.8/1M

Claude Sonnet 4.6

200K tokens | $3/1M

GPT-4o-mini*

128K tokens | $0.15/1M

GPT-5*

256K tokens | $2/1M

DeepSeek V3*

128K tokens | $0.27/1M

Llama 4 405B

128K tokens | $0.8/1M

Log scale (8K to 2M tokens)|*= Sweet spot (large context at low cost)

Gemini 3 Pro

The largest context window available. At $1.25/1M tokens, you can process an entire book for under $3.

Gemini 3 Flash

1M context at $0.075/1M tokens. Process a 750K-token document for just $0.056. The best context-per-dollar ratio.

200K

Claude Haiku 4.5

200K context at $0.80/1M. The sweet spot for complex reasoning tasks that need substantial context.

The Real Cost Equation

Many developers only look at input price. That is a mistake. The actual formula has two parts, and the output side often dominates.

Total Cost Formula

Cost = (input_tokens x input_price) + (output_tokens x output_price)

Why output tokens cost more

Reading input is parallelizable — the model processes all tokens at once via matrix multiplication. Generating output is sequential — each token depends on the previous one (autoregressive decoding). This makes output 2-5x more compute-intensive, and pricing reflects that.

GPT-5

Output is 4x input price

Claude Sonnet 4.6

Output is 5x input price

Llama 4 (hosted)

Output is 1x input price

Concrete Examples

Summarizing a 50-page PDF (~37,500 tokens input, ~500 tokens output)

Model	Input Cost	Output Cost	Total
GPT-5	$0.094	$0.005	$0.099
GPT-4o-mini	$0.006	$0.0003	$0.006
Claude Sonnet 4.6	$0.113	$0.008	$0.120
Gemini 3 Flash	$0.003	$0.0002	$0.003

For summarization (high input, low output), input price dominates. GPT-4o-mini is 16x cheaper than GPT-5.

Generating a 2,000-word blog post (~200 tokens input, ~2,700 tokens output)

Model	Input Cost	Output Cost	Total
GPT-5	$0.0005	$0.027	$0.028
GPT-4o-mini	$0.00003	$0.002	$0.002
Claude Sonnet 4.6	$0.0006	$0.041	$0.041

For generation (low input, high output), output price dominates. Claude Sonnet 4.6 costs 1.5x more than GPT-5 here because of its 5x output multiplier.

When to use expensive vs cheap models

Use cheap models (GPT-4o-mini, Gemini Flash, DeepSeek V3) for:

- Text classification and sentiment
- Data extraction from structured docs
- Simple summarization
- Embedding generation
- First-pass filtering in pipelines

Use premium models (GPT-5, Claude Sonnet 4.6) for:

- Complex multi-step reasoning
- Code generation and debugging
- Nuanced writing and editing
- Math and logic problems
- Tasks where errors are costly

Cost Optimization Strategies

Beyond picking the right model, there are structural techniques that can cut your LLM costs by 50-90%. The best production systems use several of these together.

Prompt Caching

Anthropic, Google, and OpenAI all offer prompt caching where repeated prefixes of your prompt are cached and charged at 10% of the normal input price. If your system prompt is 4,000 tokens and you send 1,000 queries, you pay full price once and 90% less for the other 999.

# Without caching: 1000 queries x 4000 tokens x $3/1M

without_cache = 1000 * 4000 / 1_000_000 * 3.00 = $12.00

# With caching: 1 full + 999 cached at 10%

with_cache = (4000/1M * $3) + (999 * 4000/1M * $0.30) = $1.21

# Savings: 90%

Available on Claude Sonnet 4.6 and Haiku 4.5. Google also offers context caching on Gemini. OpenAI supports it on GPT-5 and o3.

Batch API

OpenAI's Batch API processes requests asynchronously within a 24-hour window at 50% off. Perfect for non-real-time tasks like content moderation, data labeling, or nightly report generation.

GPT-5 standard

$2.50 / $10.00 per 1M

GPT-5 batch

$1.25 / $5.00 per 1M

Model Routing

Route easy queries to cheap models and hard queries to expensive ones. A simple classifier (or even the cheap model itself) decides the complexity. In practice, 70-80% of queries can be handled by the small model.

# Routing strategy example

if is_simple_query(user_input):

response = call_model("gpt-4o-mini", user_input) # $0.15/1M

else:

response = call_model("gpt-5", user_input) # $2.00/1M

# If 75% of queries are simple:

# Blended cost = 0.75 * $0.15 + 0.25 * $2.50 = $0.74/1M (vs $2.50)

Context Compression

Reduce input tokens without losing information. Techniques include:

Summarize-then-query

Use a cheap model to summarize long documents, then query the summary with an expensive model. Cuts input by 80-90%.

Sliding window

For conversations, keep only the last N messages plus a rolling summary. Prevents context from growing unbounded.

RAG over stuffing

Retrieve only relevant chunks instead of dumping entire documents into context. 10 relevant paragraphs beat 100 pages.

Strip formatting

Remove HTML, markdown, and whitespace from documents before sending. Can reduce token count by 20-40% on web content.

Stacking these strategies

Combine routing (70% savings) + caching (90% savings on cached portion) + batch API (50% savings on async tasks) and you can realistically reduce costs by 80-95% compared to sending everything to the most expensive model in real-time.

Naive approach: 10,000 queries/day x GPT-5 = $25/day = $750/month

Optimized: routing + caching + batch = $2.50/day = $75/month

Practical Tips

1. Count Before You Send

Always count tokens before making API calls, especially with user-provided content. A single function can save you from truncated responses and unexpected bills.

import tiktoken

def check_token_limit(text: str, model: str = "gpt-4", max_tokens: int = 8000) -> dict:
    enc = tiktoken.encoding_for_model(model)
    token_count = len(enc.encode(text))
    return {
        "count": token_count,
        "within_limit": token_count <= max_tokens,
        "remaining": max_tokens - token_count
    }

2. Set max_tokens for Outputs

Always set max_tokens in your API calls to prevent runaway costs and ensure you stay within context limits. Without it, the model may generate thousands of tokens when you only need a short answer — and you pay for every one.

3. Truncate Strategically

When content exceeds limits, truncate intelligently — keep the most relevant parts. For conversations, keep the system message + first message (task definition) + recent messages. For documents, keep the first and last sections (where key information tends to cluster).

4. Watch for Tokenization Surprises

Common pitfalls that waste tokens or produce unexpected behavior:

Numbers split unexpectedly

"123456789" may become 3+ tokens. Format large numbers carefully.

Whitespace costs tokens

Extra newlines, indentation in prompts — all cost tokens. Minify verbose prompts.

Non-English text costs more

CJK characters: 2-3x more tokens. Arabic, Cyrillic: ~1.5x. Budget accordingly.

Code is token-expensive

Brackets, operators, indentation — code uses 2-3x more tokens per semantic unit than prose.

Key Takeaways

1
Tokens are subwords, not characters or words — LLMs see text as token sequences learned via BPE. Common words are one token; rare words split into pieces. This is why LLMs struggle with letter counting and string reversal.
2
Context window = input + output — Budget your tokens. A 128K context does not mean 128K input if you need a long response.
3
Output tokens cost 2-5x more than input — For generation-heavy tasks, the output price dominates. Factor this into model selection.
4
Use the Pareto frontier to choose models — GPT-4o-mini, Claude Haiku 4.5, and Gemini Flash sit on the efficiency frontier. Models below the line are overpriced.
5
Stack optimization strategies — Caching + routing + batching can reduce costs 80-95%. The best systems never send everything to the most expensive model.

Next: Environment Setup Back to Roadmap

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.