Cross-Sparse Attention (XSA) in Parameter Golf | CodeSOTA

Attention is quadratic in sequence length and expensive in parameters. Parameter Golf contestants discovered that most of those parameters are dead weight — and that replacing them with structured sparsity unlocks real capacity.

The problem with full attention in a 16MB budget

Standard multi-head attention stores three dense projection matrices (Q, K, V) per layer, each of size d × d. At d=512 and 12 layers, that's ~9.4M parameters per layer in attention alone — before you even get to MLPs. For a model that has to live under 16MB post-quantization, that math doesn't work.

Cross-sparse attention (XSA) is the answer: instead of dense Q/K/V projections, use structured sparse ones that share parameters across heads and across token positions. You get most of the representational power at a fraction of the parameter cost.

XSA-all: the #1 approach

Submission #1 · 1.1147 BPB · abaybektursun

Self-Gen GPTQ + XSA-all

The leaderboard-winning configuration applies cross-sparse attention to every layer of the transformer, not just a subset. This is aggressive: most prior work on sparse attention targets the first or last few layers, where the representational load is highest. XSA-all compresses attention uniformly and relies on the self-generated GPTQ quantization (see the quantization editorial) to recover lost precision.

The trade-off is clear: you lose the option to have full attention anywhere, but every layer runs at the same reduced parameter cost, which is exactly what a tight memory budget rewards.

Partial XSA: the selective approach

Submission #5 · 1.1271 · jfprincz — XSA4 + EMA + Int6

Submission #6 · 1.1307 · unnir — Efficient Partial XSA

Two different submissions landed at a similar insight from opposite directions: apply XSA only to the deepest layers. jfprincz applies it to the last 4 layers; unnir applies it to the 3 deepest. The intuition: deeper layers do less retrieval and more composition, so they tolerate sparse attention patterns better than shallow layers that need to look up tokens precisely.

These partial approaches give up ~0.015 BPB versus XSA-all but let the authors combine them with more aggressive tricks elsewhere (int6 quantization, EMA, etc.). The result: nearly the same BPB, achieved through a completely different trade-off.

Why it works

The classical story about attention — that heads specialize, that Q and K learn distinct semantic roles — was always an overstatement for small models. At 70M parameters, most heads are redundant. XSA exploits that redundancy: instead of training many dense heads and hoping they diverge, it enforces a sparse structure from the start, which acts as a strong inductive bias.

For tiny language models, this is a capacity amplifier. The attention parameters you save can be spent on more MLP width, more layers, or better embeddings — all of which turn out to matter more than a marginally fuller attention matrix.

What this implies outside the competition

If cross-sparse attention holds up at scale, it reframes efficient inference work like FlashAttention and Multi-Query Attention as special cases of a broader principle: attention is structurally over-parameterized, and the right move is to constrain it before you train, not prune it after.