OpenAI Parameter Golf
Train a language model that fits in 16MB and trains in 10 minutes on 8×H100s. Lowest bits-per-byte wins. OpenAI put up $1M in compute credits. 1,500+ submissions. What's emerging is a masterclass in how to fit maximum knowledge into minimum parameters.
What's happening
Since launching on March 18, 2026, participants have compressed the naive baseline from 1.2244 to 1.1147 BPB — a 9% improvement — through innovations in three categories: quantization (ternary weights, GPTQ, int5/6 mixed precision), architecture (cross-sparse attention, depth recurrence, parallel residuals), and training tricks (test-time training, the Muon optimizer, sliding window evaluation).
Open PRs today claim scores as low as 0.83 BPB, suggesting the next wave of confirmed records will shatter existing marks. The techniques emerging here — fitting maximum knowledge into minimum parameters — are directly relevant to on-device AI and edge deployment.
The top ideas, broken down
One editorial per dominant technique. Each deep-dive explains what it is, why it works, and which submissions pushed it.
Quantization: from FP32 to ternary
How GPTQ-lite, int5/6 mixed precision, and {-1,0,1} ternary weights fit more knowledge into 16MB than anyone thought possible. The self-generated calibration trick that put abaybektursun at #1.
Cross-Sparse Attention (XSA)
Attention costs quadratic memory, but not every layer needs full context. XSA variants — all-layer, last-4, deepest-3 — dominate 3 of the top 6 leaderboard slots. What it is, why it works for tiny models.
Test-Time Training + Muon
The model keeps learning as it evaluates — TTT turns inference into continued optimization. Paired with the Muon optimizer, it's the only non-quantization-dominated top-3 entry, and it powers most pending record PRs.
Architecture tricks: RoPE, residuals, activations
When you can't add parameters, you have to use the ones you have better. Partial rotary embeddings, parallel residuals, depth recurrence, SmearGate, LeakyReLU² — the micro-innovations compounding into big wins.
Leaderboard (confirmed)
Top 10 as of April 10, 2026
| # | Submission | BPB |
|---|---|---|
| 1 | Self-Gen GPTQ + XSA-all | 1.1147 |
| 2 | LeakyReLU² + TTT + Muon | 1.1194 |
| 3 | EMA + GPTQ-lite | 1.1228 |
| 4 | Partial RoPE + LN Scale | 1.1248 |
| 5 | XSA4 + EMA + Int6 | 1.1271 |
| 6 | Efficient Partial XSA | 1.1307 |
| 7 | Int5-MLP + BigramHash | 1.1428 |
| 8 | SmearGate + BigramHash | 1.1458 |
| 9 | MLP3x + Int6 QAT | 1.1502 |
| 10 | Ternary U-Net 73.7M | 1.1570 |
Pending record claims
Open PRs with scores that would shatter the current SOTA if confirmed.
SLOT-24 + Pre-Quant AdamW TTT
ndokutovich
Recur345 + Par7 + Pre-Quant TTT
ndokutovich
Pre-quant TTT + Parallel Residuals
joshkmartinez