Test-Time Training + Muon in Parameter Golf | CodeSOTA

Every other top submission is a static artifact: train once, freeze, evaluate. Test-time training breaks that contract — the model keeps updating on the evaluation stream itself, blurring the line between inference and learning.

The core idea

Test-time training (TTT) treats each evaluation sample — or more precisely, each token of the evaluation stream — as an additional piece of training data. The model takes a few gradient steps to adapt its weights before making each prediction, using a self-supervised objective (typically next-token prediction on recent tokens).

In most benchmarks this would be cheating, but Parameter Golf's rules allow it: the artifactis limited to 16MB and the training budget is 10 minutes on 8×H100s. Whatever happens during eval is free, as long as the shipped artifact stays under budget. Contestants figured this out fast.

Why it's a parameter-efficiency trick

A static model has to store, in its weights, every piece of knowledge it might need. TTT changes that: the model only needs to store enough structure to adapt quickly to whatever distribution it sees at eval time. This shifts the model from being a knowledge store to being a learning system — and learning systems can be dramatically smaller than knowledge stores.

In other words: TTT is a compression strategy. Instead of storing answers, you store the capacity to derive answers from a few tokens of context. That's a much tighter encoding.

The Muon optimizer

Submission #2 · 1.1194 BPB · abaybektursun

LeakyReLU² + TTT + Muon

Muon is a relatively new optimizer (released late 2024) that applies an orthogonalization step — specifically, a Newton-Schulz iteration — to the momentum buffer before taking an update. The effect is that updates have well-conditioned singular values, which matters enormously in the low-data, few-step regime of test-time training.

AdamW, the default modern optimizer, works beautifully when you have millions of gradient steps to smooth out bad early updates. In TTT you have maybe 5-10 steps per token. Every step needs to be a clean step. Muon's orthogonalization gives you that cleanliness.

Submission #2 pairs Muon with a LeakyReLU² activation — squared LeakyReLU, which gives higher-order nonlinearity without new parameters — and a conservative learning-rate schedule that lets the TTT updates stay small but directionally correct.

Pending records: TTT is the future

Every one of the three pending record PRs uses TTT as its primary lever:

0.8265 BPB — ndokutovich — SLOT-24 + Pre-Quant AdamW TTT
1.0600 BPB — ndokutovich — Recur345 + Par7 + Pre-Quant TTT
1.0736 BPB — joshkmartinez — Pre-quant TTT + Parallel Residuals

The phrase "pre-quant TTT" is interesting: it means the model runs TTT in its unquantizedform and only quantizes the final weights at the very end of each adaptation pass. That preserves the gradient signal during learning while still shipping a tiny artifact.

If 0.8265 BPB gets confirmed, it would represent a ~26% improvement over the current #1 — and it would confirm that static training is no longer the dominant paradigm for sub-20MB models.

What this means for on-device AI

The on-device AI story has always been about running static quantized models on constrained hardware. TTT changes that premise: the model you deploy might be smaller than the model you actually run, with the rest of its capability emerging during use. For phones, watches, and edge devices — where compute is cheap but storage is precious — that trade-off is the right shape.