Reinforcement Learningreinforcement-learning

Atari Games

Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).

1 datasets9 resultsView full task mapping →

Atari games are the original deep RL benchmark — 57 games from the Arcade Learning Environment where agents learn directly from pixel inputs. DQN's 2013 breakthrough launched the field; by 2025, agents achieve superhuman scores on most games, with sample efficiency and generalization as the remaining frontiers.

History

2013

DQN (Mnih et al., DeepMind) learns to play Atari from raw pixels using deep Q-learning

2015

DQN published in Nature — achieves human-level on 29 of 49 games tested

2015

Double DQN, Dueling DQN, and Prioritized Experience Replay improve stability and performance

2016

A3C (Asynchronous Advantage Actor-Critic) enables parallel training across games

2017

Rainbow DQN combines 6 improvements into one agent, setting new records

2018

Ape-X scales distributed experience replay to 360 actors

2020

Agent57 (DeepMind) achieves superhuman scores on all 57 Atari games

2021

EfficientZero achieves strong Atari performance with only 2 hours of gameplay (100K steps)

2023

BBF (Bigger, Better, Faster) achieves superhuman median with 2 hours of experience

2024

DIAMOND (Diffusion world model) plays Atari using learned video generation models

How Atari Games Works

1Frame ObservationThe agent receives raw pixe…2Action SelectionA neural network (CNN for D…3Environment Interacti…The selected action is exec…4Experience StorageTransitions (state5Network UpdateThe neural network is updat…Atari Games Pipeline
1

Frame Observation

The agent receives raw pixel frames (typically 84x84 grayscale, stacked 4 frames for velocity information) as state input.

2

Action Selection

A neural network (CNN for DQN-family, transformer for newer methods) maps the visual state to Q-values or a policy distribution over 18 possible joystick actions.

3

Environment Interaction

The selected action is executed in the emulator, returning a reward signal (game score delta) and the next frame.

4

Experience Storage

Transitions (state, action, reward, next state) are stored in a replay buffer for off-policy learning.

5

Network Update

The neural network is updated to better predict future rewards (DQN) or to maximize expected returns (policy gradient methods).

Current Landscape

Atari games in 2025 are a mature benchmark — solved in the superhuman sense but still valuable for testing sample efficiency, generalization, and new algorithmic ideas. The field has moved from 'can we beat humans?' (yes, since Agent57 in 2020) to 'how efficiently?' (EfficientZero/BBF show 100K-step learning) and 'can we generalize?' (multi-game agents). World models that learn to simulate the game internally (MuZero, DIAMOND) represent the cutting-edge paradigm.

Key Challenges

Sample efficiency — standard DQN requires 200M frames (~38 days of gameplay) per game to converge

Hard exploration games — Montezuma's Revenge, Pitfall, and Private Eye require long-term planning with sparse rewards

Generalization — agents trained on one game typically cannot play others without retraining from scratch

Benchmark saturation — superhuman performance on most games means Atari is less discriminating for frontier methods

Stochasticity — sticky actions and random starts are needed to prevent memorization of deterministic game patterns

Quick Recommendations

Research baseline

PPO / Rainbow DQN

Well-understood, reproducible baselines with extensive literature

Sample-efficient RL

EfficientZero / BBF

Best performance in the 100K interaction regime (2 hours of gameplay)

Exploring new paradigms

DIAMOND (diffusion world model)

Represents the frontier of learning environment dynamics as generative models

Multi-game generalization

Gato / multi-game decision transformers

Tests generalization across the full Atari suite

What's Next

Atari is transitioning from a primary benchmark to a development testbed. The frontier has moved to 3D environments (Minecraft, NetHack), real-world robotics, and LLM-based agents. But Atari remains valuable for rapid prototyping of RL ideas due to fast simulation and extensive baselines.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Atari Games benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Atari Games Benchmarks - Reinforcement Learning - CodeSOTA | CodeSOTA