Atari Games
Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).
Atari games are the original deep RL benchmark — 57 games from the Arcade Learning Environment where agents learn directly from pixel inputs. DQN's 2013 breakthrough launched the field; by 2025, agents achieve superhuman scores on most games, with sample efficiency and generalization as the remaining frontiers.
History
DQN (Mnih et al., DeepMind) learns to play Atari from raw pixels using deep Q-learning
DQN published in Nature — achieves human-level on 29 of 49 games tested
Double DQN, Dueling DQN, and Prioritized Experience Replay improve stability and performance
A3C (Asynchronous Advantage Actor-Critic) enables parallel training across games
Rainbow DQN combines 6 improvements into one agent, setting new records
Ape-X scales distributed experience replay to 360 actors
Agent57 (DeepMind) achieves superhuman scores on all 57 Atari games
EfficientZero achieves strong Atari performance with only 2 hours of gameplay (100K steps)
BBF (Bigger, Better, Faster) achieves superhuman median with 2 hours of experience
DIAMOND (Diffusion world model) plays Atari using learned video generation models
How Atari Games Works
Frame Observation
The agent receives raw pixel frames (typically 84x84 grayscale, stacked 4 frames for velocity information) as state input.
Action Selection
A neural network (CNN for DQN-family, transformer for newer methods) maps the visual state to Q-values or a policy distribution over 18 possible joystick actions.
Environment Interaction
The selected action is executed in the emulator, returning a reward signal (game score delta) and the next frame.
Experience Storage
Transitions (state, action, reward, next state) are stored in a replay buffer for off-policy learning.
Network Update
The neural network is updated to better predict future rewards (DQN) or to maximize expected returns (policy gradient methods).
Current Landscape
Atari games in 2025 are a mature benchmark — solved in the superhuman sense but still valuable for testing sample efficiency, generalization, and new algorithmic ideas. The field has moved from 'can we beat humans?' (yes, since Agent57 in 2020) to 'how efficiently?' (EfficientZero/BBF show 100K-step learning) and 'can we generalize?' (multi-game agents). World models that learn to simulate the game internally (MuZero, DIAMOND) represent the cutting-edge paradigm.
Key Challenges
Sample efficiency — standard DQN requires 200M frames (~38 days of gameplay) per game to converge
Hard exploration games — Montezuma's Revenge, Pitfall, and Private Eye require long-term planning with sparse rewards
Generalization — agents trained on one game typically cannot play others without retraining from scratch
Benchmark saturation — superhuman performance on most games means Atari is less discriminating for frontier methods
Stochasticity — sticky actions and random starts are needed to prevent memorization of deterministic game patterns
Quick Recommendations
Research baseline
PPO / Rainbow DQN
Well-understood, reproducible baselines with extensive literature
Sample-efficient RL
EfficientZero / BBF
Best performance in the 100K interaction regime (2 hours of gameplay)
Exploring new paradigms
DIAMOND (diffusion world model)
Represents the frontier of learning environment dynamics as generative models
Multi-game generalization
Gato / multi-game decision transformers
Tests generalization across the full Atari suite
What's Next
Atari is transitioning from a primary benchmark to a development testbed. The frontier has moved to 3D environments (Minecraft, NetHack), real-world robotics, and LLM-based agents. But Atari remains valuable for rapid prototyping of RL ideas due to fast simulation and extensive baselines.
Benchmarks & SOTA
Related Tasks
Continuous Control
Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the OpenAI Gym suite in the mid-2010s. SAC (2018) and TD3 became reliable baselines, but the field shifted toward harder locomotion (humanoid parkour, dexterous hands) and sim-to-real transfer after DeepMind's dm_control and Isaac Gym raised the bar. DreamerV3 (2023) showed that world-model approaches can match or beat model-free methods across dozens of control tasks with a single hyperparameter set, signaling a move toward generalist RL agents.
Offline RL
Offline RL — learning policies from fixed datasets without further environment interaction — matters because most real-world domains (healthcare, robotics, autonomous driving) can't afford online exploration. CQL (2020) and IQL (2022) established strong baselines on the D4RL benchmark, but the field was disrupted by Decision Transformer (2021), which recast RL as sequence modeling. The latest wave uses pretrained language models as policy backbones, blurring the line between offline RL and in-context learning, with benchmarks like CORL tracking reproducibility across dozens of algorithms.
Something wrong or missing?
Help keep Atari Games benchmarks accurate. Report outdated results, missing benchmarks, or errors.