Guide~15 min to first run

Getting started with Parameter Golf

Train your first 16MB language model. This walks through the official flow from the parameter-golf repo: MLX smoke test on an Apple Silicon Mac, then scale up to a 1×H100 on Runpod.

The challenge, in one paragraph

Train a language model whose final artifact (code + compressed weights) fits in 16,000,000 bytes (decimal MB, not MiB) and trains in under 10 minutes on 8×H100 SXM. Score is bits-per-byte on the FineWeb validation set — lower is better. Naive baseline is 1.2244 BPB, the current SOTA is under 1.10, and pending PRs claim below 0.83.

The challenge runs March 18 → April 30, 2026. OpenAI is sponsoring $1M in compute credits via an application form, and intends to hire a small cohort of early-career researchers from standout participants in June.

Prerequisites

Python 3.10+ and git
Either an Apple Silicon Mac (for local MLX smoke testing) or a cloud GPU account. Runpod is the officially recommended option.
~20GB free disk for the cached FineWeb dataset (or ~2GB if you pass --train-shards 1)
Optional: SSH key configured for Runpod

Clone & set up the repo

On a Mac with Apple Silicon, create a fresh virtualenv and install the MLX stack:

git clone https://github.com/openai/parameter-golf.git
cd parameter-golf
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install mlx numpy sentencepiece huggingface-hub datasets tqdm

Not on Apple Silicon? The README suggests asking Codex to refactor train_gpt_mlx.py off MLX — it's a straightforward change. Or skip straight to the Runpod path below.

Download the cached FineWeb

Use the 1024-token SentencePiece tokenizer variant. For a local smoke test, grab just one training shard — full download is 80 shards / ~8B tokens.

# Smoke-test subset (fast):
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1

# Full local dataset (8B tokens):
python3 data/cached_challenge_fineweb.py --variant sp1024

This populates ./data/datasets/fineweb10B_sp1024/ and ./data/tokenizers/. Validation always runs on the full fineweb_val_* split — the fixed first 50k documents.

Run the MLX smoke test

200 iterations on your Mac, no periodic validation, just a final BPB print:

RUN_ID=mlx_smoke \
ITERATIONS=200 \
TRAIN_BATCH_TOKENS=8192 \
VAL_LOSS_EVERY=0 \
VAL_BATCH_SIZE=8192 \
python3 train_gpt_mlx.py

You'll see a final val_loss and val_bpb at the end. It won't be competitive — the point here is to confirm the pipeline runs end-to-end before you spend any money on cloud compute.

Scale up: 1×H100 on Runpod

Final leaderboard submissions must run on 8×H100 SXM (~$20/hr) — but you don't need 8 GPUs to iterate. Start with a single H100, ~$2-3/hr, and only rent the full box when you're ready to time a real run.

Create a Runpod account and add your SSH key under Settings.
Deploy the official Parameter Golf Runpod template on a 1×H100 pod. Enable SSH terminal access; leave everything else at defaults.
SSH in; you should land in /workspace/.

All Python dependencies are pre-installed in the image. Clone the repo onto local disk:

cd /workspace
git clone https://github.com/openai/parameter-golf.git
cd parameter-golf
python3 data/cached_challenge_fineweb.py --variant sp1024

Your first real training run

Launch the baseline config on the single H100. Note nproc_per_node=1 — bump it to 8 once you're on an 8×H100 box.

RUN_ID=baseline_sp1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=1 train_gpt.py

Final val_bpb should land around 1.22 with a compressed model under 16MB. That's the naive baseline — your starting point, not a target.

Tip: the script has a ~10 minute wallclock cap by default. For longer experiments (non-record submissions), override with MAX_WALLCLOCK_SECONDS=0. For periodic validation logs, set VAL_LOSS_EVERY=200.

The rules you must not break

16,000,000 bytes total

Code + compressed weights. Decimal, not MiB.

Under 10 min on 8×H100 SXM

Wallclock training budget.

Eval must run in ≤10 min

Additional to training. Any seq length allowed.

No network at eval

Fully self-contained reproducible artifact.

No validation leakage

Can't train on val tokens before you've graded them.

All counted code in train_gpt.py

External Python libs are free to import — within reason.

Subtle point on test-time training: you can TTT on validation tokens, but only ones you've already scored. Training on val tokens before scoring them is cheating.

Submitting a record

Submissions are pull requests that add a new folder under /records/track_10min_16mb/. Your PR must include:

A README.md explaining the submission
A submission.json with your name, GitHub ID, val_bpb, and metadata
A train log (typically averaged over 3 runs to show statistical significance)
Your train_gpt.py and any dependencies — the script must actually compile and run from inside the records folder

To be accepted as a record, your run must:

Beat the current SOTA by at least 0.005 nats, with enough run logs to show the improvement at p < 0.01
Reproducibly run in under 10 min on 8×H100 SXM
If you changed the tokenizer or dataset, prove rigorously that val_bpb is calculated correctly

Submissions that don't beat SOTA but demonstrate something unique can still land as non-record submissions, including in the separate unlimited-compute track.

The $1M compute grant

OpenAI is sponsoring $1M in compute credits to help people who couldn't otherwise afford an 8×H100 box. Use the Request a Compute Grant form. Apply with an email tied to your OpenAI / ChatGPT account and write a real justification.

If you're a student or early-career researcher, also fill out the Challenge Participant Form. OpenAI plans to hire a small cohort of early-career researchers in June 2026 from standout participants.

What to actually try first

Once your baseline run lands at ~1.22 BPB, you have the full leaderboard as a menu of ideas. The fastest way to understand what works is to read the CodeSOTA editorials on the dominant techniques:

Quantization

GPTQ, int5/6, ternary — how #1 fits 70M+ params in 16MB

Cross-Sparse Attention

Dominant in 3 of the top 6 — structured sparsity as capacity amplifier

Test-Time Training + Muon

The rare non-quant top-3, and every pending record PR uses it

Architecture Tricks

Partial RoPE, parallel residuals, SmearGate, depth recurrence

Lost on an acronym? The methods dictionary defines every technique, optimizer, and quantization scheme that appears on the leaderboard.

OpenAI also maintains a list of "requested" experiments in the README — ideas they'd love to see implemented but haven't landed yet. Open slots as of writing: JEPA, text diffusion, H-net tokenization, universal transformer (long-form), megakernels, state-space models, and learning adapters on random linear maps. These are a good target for non-record submissions in the unlimited compute track.

← All Parameter Golf editorials parameter-golf on GitHub →