Editorial#4, #8 on leaderboard

Architecture tricks

When you can't add parameters, you have to use the ones you have better. This is the category of unglamorous micro-innovations — rotary embeddings, residual patterns, activations — that compound into a meaningful chunk of the leaderboard.

Partial RoPE + LN scale

Submission #4 · 1.1248 BPB · jfprincz

Partial RoPE + LN Scale

Rotary position embeddings (RoPE) are the default modern approach to encoding sequence order in transformer attention. Standard RoPE rotates every dimension of the query and key vectors. Partial RoPE rotates only a subset — typically the first half — leaving the rest position-invariant.

The parameter savings are modest (RoPE itself is a fixed transform, not a learned one), but the training dynamics change: the non-rotated dimensions learn position-agnostic features, and the rotated ones learn position-sensitive features. That split specialization helps small models by making each head more interpretable and easier to train.

The second half of submission #4 — layer-wise LayerNorm scaling — is even subtler. Instead of learning per-layer affine parameters for LayerNorm, the submission uses a single scalar multiplier per layer, saving 2d parameters per layer without measurable quality loss.

SmearGate: activation as capacity

Submission #8 · 1.1458 BPB · Raahil Shah

SmearGate + BigramHash

SmearGate is a gated activation that replaces the standard SwiGLU in the MLP block. The key property: it expands the effective dimensionality of the MLP without increasing the parameter count, by sharing gating weights across adjacent channels.

Paired with a 3x MLP expansion ratio (rather than the conventional 4x), it puts parameter savings directly into width. The BigramHash half of the submission is a cheap embedding trick: instead of learning embeddings for every token pair, it hashes bigrams into a small shared table, giving a bigram-aware model at essentially zero parameter cost.

Parallel residuals and depth recurrence

Two techniques showing up in the pending PRs — ndokutovich's Recur345 + Par7 and joshkmartinez's Parallel Residuals — are about recirculating computation instead of stacking new layers.

  • Parallel residuals compute the attention and MLP blocks of a layer in parallel rather than sequentially, then sum them into the residual stream. This was first shown to work in GPT-J and PaLM. It saves one LayerNorm per layer and lets the attention and MLP operate on the same input — which helps in the low-capacity regime because both blocks see the same signal.
  • Depth recurrence reuses the same layer multiple times. If you have 6 unique layers applied 2x each, you pay for 6 layers' worth of parameters but get 12 layers' worth of computation. For tiny models this is a direct capacity win, and it pairs naturally with TTT because the recurrent layers can be adapted as a group.

LeakyReLU² and higher-order nonlinearity

Submission #2 (covered in the TTT editorial) uses LeakyReLU² — the element-wise square of a LeakyReLU — as its activation. Squaring adds a second-order term that increases expressiveness without parameters, and the LeakyReLU ensures gradients still flow on the negative side.

It's a tiny change, but these tiny changes are exactly what separates 1.12 from 1.11 BPB in a competition where the top 10 are separated by 5%.

The underlying philosophy

None of these tricks is individually revolutionary. Partial RoPE has been around since 2023; parallel residuals since 2021; depth recurrence since Universal Transformers in 2018. What Parameter Golf is demonstrating is a selection effect: when you have a hard parameter budget, the best known micro-innovations across the literature all become load-bearing at once.

The leaderboard is, in effect, a ranked list of which inductive biases give the most capacity per parameter. That's a genuinely useful output for anyone designing small models — and it's why these submissions deserve a closer read than "just another benchmark."