Quantization in Parameter Golf: GPTQ, Int5/6, Ternary | CodeSOTA

When every megabyte counts, you stop storing weights as 32-bit floats. Parameter Golf contestants are squeezing 70M+ parameter models into 16MB by rethinking the precision stack from the ground up.

Why quantization is the dominant strategy

A 70M parameter model in FP32 is 280MB — seventeen times over the budget. At INT8 it's still 70MB. Even at INT4 you're at 35MB, more than double the artifact limit. So contestants have to push well below 4 bits per weight, and they have to do it without destroying model quality.

Six of the top 10 submissions use quantization as their primary lever. The rest use it as a secondary lever layered on top of architecture or training tricks.

The #1 trick: self-generated GPTQ calibration

Submission #1 · 1.1147 BPB · abaybektursun

Self-Gen GPTQ + XSA-all

GPTQ is a post-training quantization method that uses a calibration dataset to decide which weights can be rounded aggressively and which need to stay precise. The standard practice is to use a held-out slice of the training corpus as the calibration set.

abaybektursun's insight: don't use the training corpus — use samples generated by the model itself. These samples capture the distribution the model actually operates in, including its own biases and failure modes, and produce a calibration signal that's tightly aligned with the weights being quantized. The result is better error recovery at extreme bit-widths.

Combined with all-layer cross-sparse attention (covered in a separate editorial), this put them at #1 with a 9% improvement over baseline.

The mixed-precision spectrum

Not every weight matters equally. Attention projections behave differently from MLP weights, embeddings have different noise tolerance than output heads. Top submissions exploit this:

Int5/Int6 mixed (thwu1, #7): critical layers at int6, MLP weights at int5, saving ~15% over uniform int6.
GPTQ-lite + EMA (signalrush, #3): exponential moving averages on the quantization error during training, so the model adapts to its own rounding.
QAT with sliding window (aruniyer, #9): quantization-aware training evaluated on a sliding window so the quant error gradient stays stable.

The extreme: ternary + LZMA

Submission #10 · 1.1570 BPB · CiprianFlorin-Ifrim

Ternary U-Net 73.7M

The most radical entry in the top 10 quantizes all 73.7M weights to just three values: {-1, 0, 1}. In theory that's log₂(3) ≈ 1.58 bits per weight, but you can't actually pack trits that tightly — so the submission stores them as a bitmask (2 bits per weight) and then compresses the whole thing with LZMA, which exploits the high ratio of zeros to drive effective bits-per-weight below 1.

It's the largest parameter count in the top 10 by a wide margin. The lesson: sometimes the right answer is "more parameters, each one cheaper."

What's next

The pending PRs claim scores as low as 0.83 BPB using what's described as "pre-quant AdamW TTT" — training the model in a quantization-aware regime before applying the final quantization pass, with the Muon and AdamW optimizers alternating. If confirmed, it would be a ~25% improvement over the current #1 — in a competition where the top 10 are separated by less than 5%.

Quantization: from FP32 to ternary

Why quantization is the dominant strategy

The #1 trick: self-generated GPTQ calibration

The mixed-precision spectrum

The extreme: ternary + LZMA

What's next