Getting started with Parameter Golf
Train your first 16MB language model. This walks through the official flow from the parameter-golf repo: MLX smoke test on an Apple Silicon Mac, then scale up to a 1×H100 on Runpod.
The challenge, in one paragraph
Train a language model whose final artifact (code + compressed weights) fits in 16,000,000 bytes (decimal MB, not MiB) and trains in under 10 minutes on 8×H100 SXM. Score is bits-per-byte on the FineWeb validation set — lower is better. Naive baseline is 1.2244 BPB, the current SOTA is under 1.10, and pending PRs claim below 0.83.
The challenge runs March 18 → April 30, 2026. OpenAI is sponsoring $1M in compute credits via an application form, and intends to hire a small cohort of early-career researchers from standout participants in June.
Prerequisites
- Python 3.10+ and git
- Either an Apple Silicon Mac (for local MLX smoke testing) or a cloud GPU account. Runpod is the officially recommended option.
- ~20GB free disk for the cached FineWeb dataset (or ~2GB if you pass
--train-shards 1) - Optional: SSH key configured for Runpod
Clone & set up the repo
On a Mac with Apple Silicon, create a fresh virtualenv and install the MLX stack:
git clone https://github.com/openai/parameter-golf.git cd parameter-golf python3 -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip pip install mlx numpy sentencepiece huggingface-hub datasets tqdm
Not on Apple Silicon? The README suggests asking Codex to refactor train_gpt_mlx.py off MLX — it's a straightforward change. Or skip straight to the Runpod path below.
Download the cached FineWeb
Use the 1024-token SentencePiece tokenizer variant. For a local smoke test, grab just one training shard — full download is 80 shards / ~8B tokens.
# Smoke-test subset (fast): python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1 # Full local dataset (8B tokens): python3 data/cached_challenge_fineweb.py --variant sp1024
This populates ./data/datasets/fineweb10B_sp1024/ and ./data/tokenizers/. Validation always runs on the full fineweb_val_* split — the fixed first 50k documents.
Run the MLX smoke test
200 iterations on your Mac, no periodic validation, just a final BPB print:
RUN_ID=mlx_smoke \ ITERATIONS=200 \ TRAIN_BATCH_TOKENS=8192 \ VAL_LOSS_EVERY=0 \ VAL_BATCH_SIZE=8192 \ python3 train_gpt_mlx.py
You'll see a final val_loss and val_bpb at the end. It won't be competitive — the point here is to confirm the pipeline runs end-to-end before you spend any money on cloud compute.
Scale up: 1×H100 on Runpod
Final leaderboard submissions must run on 8×H100 SXM (~$20/hr) — but you don't need 8 GPUs to iterate. Start with a single H100, ~$2-3/hr, and only rent the full box when you're ready to time a real run.
- Create a Runpod account and add your SSH key under Settings.
- Deploy the official Parameter Golf Runpod template on a 1×H100 pod. Enable SSH terminal access; leave everything else at defaults.
- SSH in; you should land in
/workspace/.
All Python dependencies are pre-installed in the image. Clone the repo onto local disk:
cd /workspace git clone https://github.com/openai/parameter-golf.git cd parameter-golf python3 data/cached_challenge_fineweb.py --variant sp1024
Your first real training run
Launch the baseline config on the single H100. Note nproc_per_node=1 — bump it to 8 once you're on an 8×H100 box.
RUN_ID=baseline_sp1024 \ DATA_PATH=./data/datasets/fineweb10B_sp1024/ \ TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \ VOCAB_SIZE=1024 \ torchrun --standalone --nproc_per_node=1 train_gpt.py
Final val_bpb should land around 1.22 with a compressed model under 16MB. That's the naive baseline — your starting point, not a target.
Tip: the script has a ~10 minute wallclock cap by default. For longer experiments (non-record submissions), override with MAX_WALLCLOCK_SECONDS=0. For periodic validation logs, set VAL_LOSS_EVERY=200.
The rules you must not break
Subtle point on test-time training: you can TTT on validation tokens, but only ones you've already scored. Training on val tokens before scoring them is cheating.
Submitting a record
Submissions are pull requests that add a new folder under /records/track_10min_16mb/. Your PR must include:
- A
README.mdexplaining the submission - A
submission.jsonwith your name, GitHub ID,val_bpb, and metadata - A train log (typically averaged over 3 runs to show statistical significance)
- Your
train_gpt.pyand any dependencies — the script must actually compile and run from inside the records folder
To be accepted as a record, your run must:
- Beat the current SOTA by at least 0.005 nats, with enough run logs to show the improvement at
p < 0.01 - Reproducibly run in under 10 min on 8×H100 SXM
- If you changed the tokenizer or dataset, prove rigorously that
val_bpbis calculated correctly
Submissions that don't beat SOTA but demonstrate something unique can still land as non-record submissions, including in the separate unlimited-compute track.
The $1M compute grant
OpenAI is sponsoring $1M in compute credits to help people who couldn't otherwise afford an 8×H100 box. Use the Request a Compute Grant form. Apply with an email tied to your OpenAI / ChatGPT account and write a real justification.
If you're a student or early-career researcher, also fill out the Challenge Participant Form. OpenAI plans to hire a small cohort of early-career researchers in June 2026 from standout participants.
What to actually try first
Once your baseline run lands at ~1.22 BPB, you have the full leaderboard as a menu of ideas. The fastest way to understand what works is to read the CodeSOTA editorials on the dominant techniques:
Lost on an acronym? The methods dictionary defines every technique, optimizer, and quantization scheme that appears on the leaderboard.
OpenAI also maintains a list of "requested" experiments in the README — ideas they'd love to see implemented but haven't landed yet. Open slots as of writing: JEPA, text diffusion, H-net tokenization, universal transformer (long-form), megakernels, state-space models, and learning adapters on random linear maps. These are a good target for non-record submissions in the unlimited compute track.