Train a language model that fits in 16MB and trains in 10 minutes on 8×H100s. Lowest bits-per-byte wins. OpenAI put up $1M in compute credits. 1,500+ submissions. What's emerging is a masterclass in how to fit maximum knowledge into minimum parameters.
Since launching on March 18, 2026, participants have compressed the naive baseline from 1.2244 to 1.1147 BPB — a 9% improvement — through innovations in three categories: quantization (ternary weights, GPTQ, int5/6 mixed precision), architecture (cross-sparse attention, depth recurrence, parallel residuals), and training tricks (test-time training, the Muon optimizer, sliding window evaluation).
Open PRs today claim scores as low as 0.83 BPB, suggesting the next wave of confirmed records will shatter existing marks. The techniques emerging here — fitting maximum knowledge into minimum parameters — are directly relevant to on-device AI and edge deployment.
One editorial per dominant technique. Each deep-dive explains what it is, why it works, and which submissions pushed it.
How GPTQ-lite, int5/6 mixed precision, and {-1,0,1} ternary weights fit more knowledge into 16MB than anyone thought possible. The self-generated calibration trick that put abaybektursun at #1.
Attention costs quadratic memory, but not every layer needs full context. XSA variants — all-layer, last-4, deepest-3 — dominate 3 of the top 6 leaderboard slots. What it is, why it works for tiny models.
The model keeps learning as it evaluates — TTT turns inference into continued optimization. Paired with the Muon optimizer, it's the only non-quantization-dominated top-3 entry, and it powers most pending record PRs.
When you can't add parameters, you have to use the ones you have better. Partial rotary embeddings, parallel residuals, depth recurrence, SmearGate, LeakyReLU² — the micro-innovations compounding into big wins.
Top 10 as of April 10, 2026
| # | Submission | BPB |
|---|---|---|
| 1 | Self-Gen GPTQ + XSA-all | 1.1147 |
| 2 | LeakyReLU² + TTT + Muon | 1.1194 |
| 3 | EMA + GPTQ-lite | 1.1228 |
| 4 | Partial RoPE + LN Scale | 1.1248 |
| 5 | XSA4 + EMA + Int6 | 1.1271 |
| 6 | Efficient Partial XSA | 1.1307 |
| 7 | Int5-MLP + BigramHash | 1.1428 |
| 8 | SmearGate + BigramHash | 1.1458 |
| 9 | MLP3x + Int6 QAT | 1.1502 |
| 10 | Ternary U-Net 73.7M | 1.1570 |
Open PRs with scores that would shatter the current SOTA if confirmed.
SLOT-24 + Pre-Quant AdamW TTT
ndokutovich
Recur345 + Par7 + Pre-Quant TTT
ndokutovich
Pre-quant TTT + Parallel Residuals
joshkmartinez