Reinforcement Learningreinforcement-learning

Atari Games

Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).

1 datasets12 resultsView full task mapping →

Atari games are the original deep RL benchmark — 57 games from the Arcade Learning Environment where agents learn directly from pixel inputs. DQN's 2013 breakthrough launched the field; by 2025, agents achieve superhuman scores on most games, with sample efficiency and generalization as the remaining frontiers.

History

2013

DQN (Mnih et al., DeepMind) learns to play Atari from raw pixels using deep Q-learning

2015

DQN published in Nature — achieves human-level on 29 of 49 games tested

2015

Double DQN, Dueling DQN, and Prioritized Experience Replay improve stability and performance

2016

A3C (Asynchronous Advantage Actor-Critic) enables parallel training across games

2017

Rainbow DQN combines 6 improvements into one agent, setting new records

2018

Ape-X scales distributed experience replay to 360 actors

2020

Agent57 (DeepMind) achieves superhuman scores on all 57 Atari games

2021

EfficientZero achieves strong Atari performance with only 2 hours of gameplay (100K steps)

2023

BBF (Bigger, Better, Faster) achieves superhuman median with 2 hours of experience

2024

DIAMOND (Diffusion world model) plays Atari using learned video generation models

How Atari Games Works

Frame Observation

The agent receives raw pixel frames (typically 84x84 grayscale, stacked 4 frames for velocity information) as state input.

Action Selection

A neural network (CNN for DQN-family, transformer for newer methods) maps the visual state to Q-values or a policy distribution over 18 possible joystick actions.

Environment Interaction

The selected action is executed in the emulator, returning a reward signal (game score delta) and the next frame.

Experience Storage

Transitions (state, action, reward, next state) are stored in a replay buffer for off-policy learning.

Network Update

The neural network is updated to better predict future rewards (DQN) or to maximize expected returns (policy gradient methods).

Current Landscape

Atari games in 2025 are a mature benchmark — solved in the superhuman sense but still valuable for testing sample efficiency, generalization, and new algorithmic ideas. The field has moved from 'can we beat humans?' (yes, since Agent57 in 2020) to 'how efficiently?' (EfficientZero/BBF show 100K-step learning) and 'can we generalize?' (multi-game agents). World models that learn to simulate the game internally (MuZero, DIAMOND) represent the cutting-edge paradigm.

Key Challenges

Sample efficiency — standard DQN requires 200M frames (~38 days of gameplay) per game to converge

Hard exploration games — Montezuma's Revenge, Pitfall, and Private Eye require long-term planning with sparse rewards

Generalization — agents trained on one game typically cannot play others without retraining from scratch

Benchmark saturation — superhuman performance on most games means Atari is less discriminating for frontier methods

Stochasticity — sticky actions and random starts are needed to prevent memorization of deterministic game patterns

Quick Recommendations

Research baseline

PPO / Rainbow DQN

Well-understood, reproducible baselines with extensive literature

Sample-efficient RL

EfficientZero / BBF

Best performance in the 100K interaction regime (2 hours of gameplay)

Exploring new paradigms

DIAMOND (diffusion world model)

Represents the frontier of learning environment dynamics as generative models

Multi-game generalization

Gato / multi-game decision transformers

Tests generalization across the full Atari suite

What's Next

Atari is transitioning from a primary benchmark to a development testbed. The frontier has moved to 3D environments (Minecraft, NetHack), real-world robotics, and LLM-based agents. But Atari remains valuable for rapid prototyping of RL ideas due to fast simulation and extensive baselines.

Benchmarks & SOTA

Atari 2600

Arcade Learning Environment (Atari 2600)

201312 results

Suite of 57 Atari 2600 games. Standard benchmark for deep reinforcement learning agents.

State of the Art

Go-Explore

Uber AI

40000

human-normalized-score

Related Tasks

Continuous Control

Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the OpenAI Gym suite in the mid-2010s. SAC (2018) and TD3 became reliable baselines, but the field shifted toward harder locomotion (humanoid parkour, dexterous hands) and sim-to-real transfer after DeepMind's dm_control and Isaac Gym raised the bar. DreamerV3 (2023) showed that world-model approaches can match or beat model-free methods across dozens of control tasks with a single hyperparameter set, signaling a move toward generalist RL agents.

Offline RL

Offline RL — learning policies from fixed datasets without further environment interaction — matters because most real-world domains (healthcare, robotics, autonomous driving) can't afford online exploration. CQL (2020) and IQL (2022) established strong baselines on the D4RL benchmark, but the field was disrupted by Decision Transformer (2021), which recast RL as sequence modeling. The latest wave uses pretrained language models as policy backbones, blurring the line between offline RL and in-context learning, with benchmarks like CORL tracking reproducibility across dozens of algorithms.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Atari Games benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Reinforcement Learning