Reference53 methods

Methods dictionary

Every technique and acronym that appears in Parameter Golf submissions, with plain-English definitions and links to the editorials and submissions where they're load-bearing. Use this as a cheat sheet while reading leaderboard PRs or designing your own entry.

Quantization

Compressing weights from FP32 to ~1 bit.

GPTQ

Post-training weight quantization that uses a calibration dataset to decide which weights can be rounded aggressively.

Frames quantization as a per-layer least-squares problem: given a calibration batch, find the INT-k weights that minimize reconstruction error on the layer's outputs. The default in ~half the top 10.

Self-Generated GPTQ Calibration

aka Self-Gen GPTQ

Instead of calibrating GPTQ on training samples, calibrate on samples the model itself generates.

abaybektursun's #1 trick. The self-generated samples capture the distribution the model actually operates in (including its own biases), producing a calibration signal tightly aligned with the weights being quantized. Better error recovery at extreme bit-widths.

#1 Self-Gen GPTQ + XSA-all · 1.1147 BPB→ editorial

GPTQ-lite

A lightweight GPTQ variant with a clip search instead of the full Hessian solve.

Trades some reconstruction fidelity for dramatically lower memory/compute during calibration. signalrush's #3 entry combines it with EMA weight averaging.

#3 EMA + GPTQ-lite · 1.1228 BPB

GPTQ Embeddings

Applying GPTQ specifically to the token embedding table (usually left in higher precision).

Embeddings dominate the parameter count in small-vocab models, so quantizing them aggressively is high-leverage. Kevin Clark's #5 entry did this with std-based clipping.

Int5 / Int6 mixed precision

Per-layer or per-block choice between 5-bit and 6-bit integer weights.

Not every weight matters equally. Attention projections behave differently from MLPs; critical layers can stay at int6 while MLP weights drop to int5 for a ~15% saving over uniform int6.

#7 Int5-MLP + BigramHash · 1.1428 BPB

QAT

aka Quantization-Aware Training

Training the model with fake-quant ops in the forward pass so the weights learn to tolerate rounding.

Unlike post-training quantization (GPTQ), QAT exposes the quantization error to the optimizer from the start. Often combined with Straight-Through Estimator (STE) to get gradients through the rounding op.

#9 MLP3x + Int6 QAT · 1.1502 BPB

STE

aka Straight-Through Estimator

Gradient trick that passes gradients unchanged through a non-differentiable quantization op.

Without STE, QAT would be impossible — round() has zero gradient almost everywhere. STE pretends the op was the identity on the backward pass.

Ternary quantization

aka {-1, 0, 1}

Weights constrained to three values. In theory ~1.58 bits/weight.

In practice stored as a 2-bit-per-weight bitmask then LZMA-compressed, exploiting the high ratio of zeros to drive effective bits-per-weight below 1. CiprianFlorin-Ifrim's #10 submission quantizes 73.7M params this way.

#10 Ternary U-Net 73.7M · 1.1570 BPB→ editorial

1-bit quantization

aka binary quantization

Weights constrained to {-1, +1}. The theoretical floor on per-weight cost.

Covered in the non-record 16MB track with a 106M-param asymmetric U-Net submission (1.1239 BPB) — proving the idea has signs of life even if it hasn't cracked the main leaderboard yet.

LZMA / zstd compression

Lossless compression applied to the final quantized weight tensor before counting artifact bytes.

Low-bit quantized tensors have highly non-uniform distributions (lots of zeros, repeated patterns) that standard LZMA/zstd can crush. Several top submissions use zstd-22 or LZMA as the final compression pass.

EMA

aka Exponential Moving Average

Keep a running average of the weights during training; use the average at evaluation.

Smooths out training noise, and — crucially in this competition — interacts well with quantization, since the averaged weights tend to have cleaner low-bit roundings.

#3 EMA + GPTQ-lite · 1.1228 BPB

Attention

Sparse and structured attention variants.

XSA

aka Cross-Sparse Attention

Attention with structured sparse Q/K/V projections that share parameters across heads and positions.

The dominant attention architecture on the leaderboard — 3 of the top 6 slots use an XSA variant. The structural sparsity is an inductive bias, not a post-hoc pruning.

XSA-all

Cross-sparse attention applied to every layer of the transformer.

The most aggressive XSA variant. Uniform parameter cost across layers, relies on quantization elsewhere to recover any lost precision. Powered the #1 submission.

#1 Self-Gen GPTQ + XSA-all · 1.1147 BPB→ editorial

XSA4

Cross-sparse attention applied only to the last 4 layers.

Intuition: deeper layers do more composition and less precise retrieval, so they tolerate sparse attention better than shallow layers.

#5 XSA4 + EMA + Int6 · 1.1271 BPB→ editorial

Partial XSA

aka Efficient Partial XSA

XSA applied to the 3 deepest layers only.

unnir's variant — even more selective than XSA4. Free parameters go into bigger MLPs or more layers elsewhere.

#6 Efficient Partial XSA · 1.1307 BPB→ editorial

SWA

aka Sliding Window Attention

Each token attends only to the previous N tokens rather than the entire context.

Standard efficient-attention trick. Several early submissions use it as a baseline before being replaced by XSA variants in later PRs.

QK-Gain

Learned scalar gain applied to the QK product before softmax.

Lets the model learn how peaked the attention distribution should be. Appears in several recent top submissions (QK-Gain 5.0, QK-Gain 5.25).

FlashAttention / FA3

Hardware-efficient attention implementation that avoids materializing the full attention matrix.

Free to import (it's a library), so Parameter Golf submissions use it freely for runtime speed without paying parameter cost.

Position encoding

Encoding token order without spending parameters.

RoPE

aka Rotary Position Embeddings

Encodes position by rotating Q and K vectors in fixed 2D subspaces.

Default modern approach. Fixed, not learned — so zero parameter cost. Used as the baseline in Parameter Golf.

Partial RoPE

Rotate only a subset of the Q/K dimensions (e.g. 16 out of 64), leaving the rest position-invariant.

Creates a split specialization: rotated dims learn position-sensitive features, non-rotated dims learn position-agnostic features. Helps small models by making each head more interpretable and easier to train.

#4 Partial RoPE + LN Scale · 1.1248 BPB→ editorial

YaRN

A RoPE variant designed to extrapolate to longer sequences than training.

Appears in the ternary / 1-bit submissions that evaluate at long context.

Training & optimization

Optimizers, schedules, weight decay.

Muon

Recent (late 2024) optimizer that orthogonalizes the momentum buffer via a Newton-Schulz iteration before the update.

Produces updates with well-conditioned singular values. In the low-data, few-step regime of Parameter Golf (and especially TTT), this cleanliness per-step matters more than AdamW-style smoothing over many steps.

MuonEq-R

An equivariant / regularized Muon variant used in several recent top submissions.

Exact details vary by implementation; appears in Kevin Clark's #5 run and dexhunter's 1.0912 entry.

NeoMuon

Another Muon derivative, paired with ternary quantization in CiprianFlorin-Ifrim's entries.

Less documented than baseline Muon; optimizer space for Parameter Golf is still evolving.

Parallel Muon

Running Muon's orthogonalization in parallel across parameter blocks.

Makes the optimizer cheap enough to use during TTT, where you do 5-10 gradient steps per evaluation token.

#2 LeakyReLU² + TTT + Muon · 1.1194 BPB

AdamW

The workhorse optimizer: Adam with decoupled weight decay.

Still used for the main pretraining loop in most submissions. Shines over long horizons with many gradient steps.

Weight Decay (WD)

L2-style regularization applied to weights during the optimizer update.

Tuning WD is a surprisingly big lever: several submissions label themselves explicitly with the WD value (WD=0.040, WD=0.090). Higher WD seems to help when combined with aggressive quantization.

Warmdown

Schedule that linearly decays the learning rate from peak to near-zero at the end of training.

'Warmdown3500' means the last 3500 steps are pure LR decay. Analogous to cosine decay but simpler and sometimes tuned per-submission.

Hessian-Aware SDClip

Stochastic descent with clipping informed by a Hessian approximation.

Robby Sneiderman's 1.0835 submission uses this to get cleaner gradient updates without paying the full Newton cost.

OrthoInit

aka Orthogonal Initialization

Initializing weight matrices as (random) orthogonal matrices.

Preserves signal norms through the forward pass at init, which helps small models converge faster. Appears in several of the mid-tier submissions (Raahil Shah #8, aquariouseworkman).

Spectral embed init

Initializing the embedding table from the top singular vectors of a bigram count matrix.

Gives the model a 'warm start' on word co-occurrence structure before training even begins.

Test-time training

Learning at inference time — the rule carve-out.

TTT

aka Test-Time Training

The model takes a few gradient steps on recent tokens before predicting each new token at eval time.

Under Parameter Golf rules this is legal as long as you only train on validation tokens you've already scored. Shifts capacity from 'stored knowledge' to 'adaptation ability' — a much tighter encoding for small models.

Legal Score-First TTT

The 'compliant' TTT variant: score every token first, then update weights using those tokens, then move on.

Enforces the 'only TTT on graded tokens' rule. Appears in almost all the recent (<1.10) top submissions.

Pre-Quant TTT

Run TTT in the unquantized (higher-precision) weight space, then quantize only at the end.

Preserves the gradient signal during adaptation while still shipping a tiny final artifact. Powers the pending 0.8265 BPB record claim.

LoRA TTT

Test-time training restricted to low-rank adapter matrices on top of the frozen base weights.

Samacqua's early TTT submission. Cheaper than full TTT but also less expressive; superseded by Pre-Quant TTT in later entries.

E2E TTT

aka End-to-End TTT

Training the entire model at test time, not just adapters or specific layers.

Listed in OpenAI's requests for PRs — nobody has landed a strong E2E TTT submission yet.

Architecture

Layer patterns, activations, residual tricks.

SmearGate

A gated MLP activation that expands effective dimensionality by sharing gating weights across adjacent channels.

Drop-in replacement for SwiGLU. Gives more MLP capacity without growing the parameter count. Paired with a 3x MLP expansion ratio (vs conventional 4x).

#8 SmearGate + BigramHash · 1.1458 BPB→ editorial

LeakyReLU²

The element-wise square of a LeakyReLU activation.

Squaring adds a second-order term for more expressiveness without parameters; the LeakyReLU tail ensures gradients still flow on the negative side. Used in abaybektursun's #2 submission.

Parallel Residuals

Compute attention and MLP blocks of a layer in parallel rather than sequentially, then sum them into the residual stream.

First shown to work in GPT-J and PaLM. Saves one LayerNorm per layer; lets attention and MLP operate on the same input, which helps in the low-capacity regime.

Depth Recurrence

aka Looped Layers

Reuse the same layer multiple times in the forward pass.

6 unique layers applied 2x each = 6 layers' worth of parameters, 12 layers' worth of computation. Direct capacity win for tiny models, and it pairs naturally with TTT.

Universal Transformer

Extreme depth recurrence: a single layer applied many times.

Listed in OpenAI's open requests for PRs — nobody has landed a strong long-form Universal Transformer submission yet.

Value Residuals

Adding a residual connection on the value tensor across attention layers.

Helps information flow through deep stacks; appears in several mid-tier submissions though it was later removed in PR #1218 as a simplification.

LN Scale

aka Layer-wise LayerNorm scaling

Replace LayerNorm's learned affine parameters with a single scalar multiplier per layer.

Saves 2d parameters per layer with no measurable quality loss.

#4 Partial RoPE + LN Scale · 1.1248 BPB→ editorial

U-Net

Encoder-decoder architecture with skip connections between matching-depth encoder and decoder layers.

Unusual for a language model — but CiprianFlorin-Ifrim's ternary and binary submissions use it with success, trading conventional depth for a resolution hierarchy.

Tokenization & embeddings

Vocab size, embedding precision, hashing.

SP1024 / SP4096 / SP8192

aka SentencePiece vocabularies

SentencePiece BPE vocabulary sizes. Larger vocabularies mean bigger embeddings but fewer tokens per sequence.

Baseline uses SP1024. Later submissions found that SP4096 and even SP8192 are worth the extra embedding cost — especially when paired with aggressive embedding quantization.

Tied Embeddings

Share the token embedding matrix with the output projection.

Halves the embedding parameter cost. Standard in the Parameter Golf baseline.

FP16 Embed

Keep the embedding table in FP16 while other weights go lower.

Embeddings are noise-sensitive; leaving them in FP16 protects quality while deeper quantization hits the rest of the model. Renier Velazco's 1.2197 submission was an early win on this.

FP8

8-bit floating point (typically E4M3 or E5M2 format).

Used for embeddings or intermediate activations in several late submissions. Cheaper than FP16, more dynamic range than INT8.

BigramHash

Hash bigrams into a small shared embedding table instead of learning dedicated bigram embeddings.

Gives the model bigram awareness at essentially zero parameter cost. Appears in submissions #7, #8, and the #1 as 'BigramHash3072'.

Hash Embeddings

Use a hash function to map tokens into a smaller learned embedding table.

Used in earlier submissions but dropped as a simplification in PR #1218 — probably because larger SP vocabularies made it less necessary.

Evaluation

The metric, windowing, and what you're scored on.

BPB

aka Bits Per Byte

The Parameter Golf metric. Tokenizer-agnostic measure of cross-entropy, normalized by raw text bytes.

BPB = (cross-entropy in nats) × (tokens / bytes) / ln(2). Lower is better. Naive baseline 1.2244, current SOTA under 1.10, pending claims below 0.83.

Sliding Window Eval

At evaluation time, slide a context window over the validation stream with a small stride (e.g. 64 tokens).

Each token gets predicted with the maximum possible context behind it. Gives a meaningful accuracy bump for free — Matthew Li's 1.1925 submission was an early win purely from this eval trick.

FineWeb val split

The fixed first-50k-document subset of FineWeb used as the held-out validation set.

Not the full FineWeb validation — a specific deterministic subset so runs are comparable. You're not allowed to train on any of its tokens before evaluating on them.