Codesota · Benchmark · Atari 2600Home/Leaderboards/Atari 2600
Farama Foundation / DeepMind

Atari 2600.

Suite of 57 Atari 2600 games. Standard benchmark for deep reinforcement learning agents.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Mean Human-Normalized Score

Mean HNS across games. Human baseline = 100. Scores >100 exceed average human performance.

Higher is better

Trust tiers for Mean Human-Normalized Scoreverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01go-explore
Exploration-focused agent. Score is Mean HNS (skewed by Montezuma's Revenge), not Median.
paper400002025Source ↗Looks wrong?
02LBC
Mean HNS 10077.52% at 1B frames on Atari 57. Median HNS 1934% (Agent57-style max-over-training). Breaks 24 human world records. ICLR 2023 Oral.
paper10078N/ASource ↗Looks wrong?
03agent57
Median HNS across 57 games. First to beat human baseline on ALL games.
paper4731.32025Source ↗Looks wrong?
04MEME
Mean HNS at 1B frames on Atari 57 (human=100). 95% CI: 3723–4445. Reaches human-level on all 57 games within 390M frames.
verified40872026Source ↗Looks wrong?
05Disco57
IQM (interquartile mean) = 13.86 at 200M frames on Atari 57. Metric differs from mean/median HNS in other entries — stored as IQM×100 for scale. Automated RL rule discovery. Nature, Oct 2025.
paper1386N/ASource ↗Looks wrong?
06bbos-1
Model-based optimization.
paper11002025N/ALooks wrong?
07gdi-h3
High sample efficiency.
paper9502025N/ALooks wrong?
08dreamerv3
Mastered Atari with fixed hyperparameters using world models.
paper8402025Source ↗Looks wrong?
09muzero
Model-based agent planning with learned model.
paper7312025Source ↗Looks wrong?
10EfficientZero V2
EfficientZero V2. Mean HNS 242.8%, Median 128.6% on Atari 100k (26 games, 100k steps). Surpasses BBF. Model-based RL with Gumbel search. arXiv Mar 2024.
paper242.82026Source ↗Looks wrong?
11rainbow-dqn
Median HNS. Combines 7 improvements to DQN.
paper2312025Source ↗Looks wrong?
12Rainbow DQN
Median HNS. Combines 7 improvements to DQN.
paper2312025Source ↗Looks wrong?
13BBF (Bigger, Better, Faster)
Bigger, Better, Faster (BBF). Mean HNS 224.7% on Atari 100k (26 games, 100k steps). IQM: 104.5%, Median: 91.7%. Value-based RL with scaled networks. ICML 2023.
unverified224.72026Source ↗Looks wrong?
14DIAMOND
DIAMOND (Diffusion World Model). Mean HNS 145.9%, IQM 64.1% on Atari 100k (26 games, 100k steps). Best agent trained entirely within a world model. NeurIPS 2024 Spotlight.
unverified145.92026Source ↗Looks wrong?
15STORM
STORM (Stochastic Transformer World Models). Mean HNS 126.7% on Atari 100k (26 games, 100k steps). Transformer-based stochastic world model. arXiv Oct 2023.
paper126.72026Source ↗Looks wrong?
16Simulus
Simulus. First planning-free world model to reach human-level IQM and median HNS on Atari 100k (26 games, 100k steps). Superhuman on 13/26 games. Combines intrinsic motivation, prioritized replay, regression-as-classification. arXiv Feb 2025.
paper1102026Source ↗Looks wrong?
17DART
DART (Discrete Abstract Representations for Transformer-based learning). Mean HNS 102.2%, Median 79.0%, IQM 57.5% on Atari 100k (26 games, 100k steps). ICML 2024.
unverified102.22026Source ↗Looks wrong?
18Human Professional
Professional human tester baseline.
unverified1002025Source ↗Looks wrong?
19human-gamer
Professional human tester baseline.
paper1002025Source ↗Looks wrong?
20DQN (Human-level)
Historical baseline (2015). Median HNS.
paper792025Source ↗Looks wrong?
21dqn
Historical baseline (2015). Median HNS.
paper792025Source ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards