Guide~15 min to first run

Getting started with Parameter Golf

Train your first 16MB language model. This walks through the official flow from the parameter-golf repo: MLX smoke test on an Apple Silicon Mac, then scale up to a 1×H100 on Runpod.

Setup instructions, commands, and rule text on this page are adapted from the openai/parameter-golf README, licensed under the MIT License. Copyright © OpenAI. Any errors in summarization are ours.

The challenge, in one paragraph

Train a language model whose final artifact (code + compressed weights) fits in 16,000,000 bytes (decimal MB, not MiB) and trains in under 10 minutes on 8×H100 SXM. Score is bits-per-byte on the FineWeb validation set — lower is better. Naive baseline is 1.2244 BPB, the current SOTA is under 1.10, and pending PRs claim below 0.83.

The challenge runs March 18 → April 30, 2026. OpenAI is sponsoring $1M in compute credits via an application form, and intends to hire a small cohort of early-career researchers from standout participants in June.

Prerequisites

  • Python 3.10+ and git
  • Either an Apple Silicon Mac (for local MLX smoke testing) or a cloud GPU account. Runpod is the officially recommended option.
  • ~20GB free disk for the cached FineWeb dataset (or ~2GB if you pass --train-shards 1)
  • Optional: SSH key configured for Runpod
1

Clone & set up the repo

On a Mac with Apple Silicon, create a fresh virtualenv and install the MLX stack:

git clone https://github.com/openai/parameter-golf.git
cd parameter-golf
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install mlx numpy sentencepiece huggingface-hub datasets tqdm

Not on Apple Silicon? The README suggests asking Codex to refactor train_gpt_mlx.py off MLX — it's a straightforward change. Or skip straight to the Runpod path below.

2

Download the cached FineWeb

Use the 1024-token SentencePiece tokenizer variant. For a local smoke test, grab just one training shard — full download is 80 shards / ~8B tokens.

# Smoke-test subset (fast):
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1

# Full local dataset (8B tokens):
python3 data/cached_challenge_fineweb.py --variant sp1024

This populates ./data/datasets/fineweb10B_sp1024/ and ./data/tokenizers/. Validation always runs on the full fineweb_val_* split — the fixed first 50k documents.

3

Run the MLX smoke test

200 iterations on your Mac, no periodic validation, just a final BPB print:

RUN_ID=mlx_smoke \
ITERATIONS=200 \
TRAIN_BATCH_TOKENS=8192 \
VAL_LOSS_EVERY=0 \
VAL_BATCH_SIZE=8192 \
python3 train_gpt_mlx.py

You'll see a final val_loss and val_bpb at the end. It won't be competitive — the point here is to confirm the pipeline runs end-to-end before you spend any money on cloud compute.

4

Scale up: 1×H100 on Runpod

Final leaderboard submissions must run on 8×H100 SXM (~$20/hr) — but you don't need 8 GPUs to iterate. Start with a single H100, ~$2-3/hr, and only rent the full box when you're ready to time a real run.

  1. Create a Runpod account and add your SSH key under Settings.
  2. Deploy the official Parameter Golf Runpod template on a 1×H100 pod. Enable SSH terminal access; leave everything else at defaults.
  3. SSH in; you should land in /workspace/.

All Python dependencies are pre-installed in the image. Clone the repo onto local disk:

cd /workspace
git clone https://github.com/openai/parameter-golf.git
cd parameter-golf
python3 data/cached_challenge_fineweb.py --variant sp1024
5

Your first real training run

Launch the baseline config on the single H100. Note nproc_per_node=1 — bump it to 8 once you're on an 8×H100 box.

RUN_ID=baseline_sp1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=1 train_gpt.py

Final val_bpb should land around 1.22 with a compressed model under 16MB. That's the naive baseline — your starting point, not a target.

Tip: the script has a ~10 minute wallclock cap by default. For longer experiments (non-record submissions), override with MAX_WALLCLOCK_SECONDS=0. For periodic validation logs, set VAL_LOSS_EVERY=200.

The rules you must not break

16,000,000 bytes total
Code + compressed weights. Decimal, not MiB.
Under 10 min on 8×H100 SXM
Wallclock training budget.
Eval must run in ≤10 min
Additional to training. Any seq length allowed.
No network at eval
Fully self-contained reproducible artifact.
No validation leakage
Can't train on val tokens before you've graded them.
All counted code in train_gpt.py
External Python libs are free to import — within reason.

Subtle point on test-time training: you can TTT on validation tokens, but only ones you've already scored. Training on val tokens before scoring them is cheating.

Submitting a record

Submissions are pull requests that add a new folder under /records/track_10min_16mb/. Your PR must include:

  • A README.md explaining the submission
  • A submission.json with your name, GitHub ID, val_bpb, and metadata
  • A train log (typically averaged over 3 runs to show statistical significance)
  • Your train_gpt.py and any dependencies — the script must actually compile and run from inside the records folder

To be accepted as a record, your run must:

  • Beat the current SOTA by at least 0.005 nats, with enough run logs to show the improvement at p < 0.01
  • Reproducibly run in under 10 min on 8×H100 SXM
  • If you changed the tokenizer or dataset, prove rigorously that val_bpb is calculated correctly

Submissions that don't beat SOTA but demonstrate something unique can still land as non-record submissions, including in the separate unlimited-compute track.

The $1M compute grant

OpenAI is sponsoring $1M in compute credits to help people who couldn't otherwise afford an 8×H100 box. Use the Request a Compute Grant form. Apply with an email tied to your OpenAI / ChatGPT account and write a real justification.

If you're a student or early-career researcher, also fill out the Challenge Participant Form. OpenAI plans to hire a small cohort of early-career researchers in June 2026 from standout participants.

What to actually try first

Once your baseline run lands at ~1.22 BPB, you have the full leaderboard as a menu of ideas. The fastest way to understand what works is to read the CodeSOTA editorials on the dominant techniques:

Lost on an acronym? The methods dictionary defines every technique, optimizer, and quantization scheme that appears on the leaderboard.

OpenAI also maintains a list of "requested" experiments in the README — ideas they'd love to see implemented but haven't landed yet. Open slots as of writing: JEPA, text diffusion, H-net tokenization, universal transformer (long-form), megakernels, state-space models, and learning adapters on random linear maps. These are a good target for non-record submissions in the unlimited compute track.