Live CompetitionEnds Apr 30, 2026

OpenAI Parameter Golf

Train a language model that fits in 16MB and trains in 10 minutes on 8×H100s. Lowest bits-per-byte wins. OpenAI put up $1M in compute credits. 1,500+ submissions. What's emerging is a masterclass in how to fit maximum knowledge into minimum parameters.

1.1147

Confirmed SOTA (BPB)

0.8265

Pending review (open PR)

1,500+

Submissions

1.2244

Baseline BPB

Getting started guide →Methods dictionary Competition repo

What's happening

Since launching on March 18, 2026, participants have compressed the naive baseline from 1.2244 to 1.1147 BPB — a 9% improvement — through innovations in three categories: quantization (ternary weights, GPTQ, int5/6 mixed precision), architecture (cross-sparse attention, depth recurrence, parallel residuals), and training tricks (test-time training, the Muon optimizer, sliding window evaluation).

Open PRs today claim scores as low as 0.83 BPB, suggesting the next wave of confirmed records will shatter existing marks. The techniques emerging here — fitting maximum knowledge into minimum parameters — are directly relevant to on-device AI and edge deployment.

The top ideas, broken down

One editorial per dominant technique. Each deep-dive explains what it is, why it works, and which submissions pushed it.

#1 (1.1147 BPB)

Quantization: from FP32 to ternary

How GPTQ-lite, int5/6 mixed precision, and {-1,0,1} ternary weights fit more knowledge into 16MB than anyone thought possible. The self-generated calibration trick that put abaybektursun at #1.

Self-Gen GPTQInt5/6 mixedTernary + LZMAQAT

#1, #5, #6

Cross-Sparse Attention (XSA)

Attention costs quadratic memory, but not every layer needs full context. XSA variants — all-layer, last-4, deepest-3 — dominate 3 of the top 6 leaderboard slots. What it is, why it works for tiny models.

XSA-allXSA4Partial XSA

#2 (1.1194 BPB)

Test-Time Training + Muon

The model keeps learning as it evaluates — TTT turns inference into continued optimization. Paired with the Muon optimizer, it's the only non-quantization-dominated top-3 entry, and it powers most pending record PRs.

TTTMuon optimizerPre-quant TTTAdamW TTT

#4, #8

Architecture tricks: RoPE, residuals, activations

When you can't add parameters, you have to use the ones you have better. Partial rotary embeddings, parallel residuals, depth recurrence, SmearGate, LeakyReLU² — the micro-innovations compounding into big wins.

Partial RoPEParallel residualsDepth recurrenceSmearGate

Leaderboard (confirmed)

Top 10 as of April 10, 2026

Source on GitHub →

#	Submission	BPB	Author	Key Innovation
1	Self-Gen GPTQ + XSA-all	1.1147	abaybektursun	Self-generated GPTQ calibration + all-layer cross-sparse attention
2	LeakyReLU² + TTT + Muon	1.1194	abaybektursun	Test-time training with parallel Muon optimizer
3	EMA + GPTQ-lite	1.1228	signalrush	GPTQ-lite quantization with exponential moving average
4	Partial RoPE + LN Scale	1.1248	jfprincz	Partial rotary embeddings, layer-wise LN scaling
5	XSA4 + EMA + Int6	1.1271	jfprincz	Cross-sparse attention on last 4 layers, int6 quant
6	Efficient Partial XSA	1.1307	unnir	Partial XSA on 3 deepest layers
7	Int5-MLP + BigramHash	1.1428	thwu1	Mixed int5/int6 quantization, bigram hashing
8	SmearGate + BigramHash	1.1458	Raahil Shah	3x MLP expansion, SmearGate activation
9	MLP3x + Int6 QAT	1.1502	aruniyer	Quantization-aware training, sliding window eval
10	Ternary U-Net 73.7M	1.1570	CiprianFlorin-Ifrim	73.7M params quantized to {-1, 0, 1} with bitmask LZMA

Pending record claims

Open PRs with scores that would shatter the current SOTA if confirmed.

0.8265BPB

SLOT-24 + Pre-Quant AdamW TTT

ndokutovich

1.0600BPB

Recur345 + Par7 + Pre-Quant TTT

ndokutovich

1.0736BPB

Pre-quant TTT + Parallel Residuals

joshkmartinez

The rules

Artifact size

16 MB

Final compressed model must fit

Training budget

10 minutes

On 8×H100 GPUs, end-to-end

Metric

Bits-per-byte

On held-out text. Lower is better.

Prize pool

$1M credits

Awarded by OpenAI

Competition repo →