Agent Memory
Benchmarks for persistent agent memory: recall, updates, deletes, contradictions, valid-time beliefs, and evidence-backed preflight decisions.
Agent Memory is a key task in agentic ai. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
Agent Memory Benchmark
Agent Memory Benchmark (AMB)
Provider-oriented benchmark for evaluating agent memory systems across retrieval, lifecycle operations, contradictions, and evidence-backed behavior. Tracked as the preferred route for comparable memory-provider claims.
No results tracked yet
Audrey Memory Benchmark Artifacts
Audrey public memory benchmark artifacts
Public Audrey deterministic regression and performance artifacts for local memory behavior. Tracked as supporting evidence only until Audrey is evaluated through the official AMB harness.
No results tracked yet
Related Tasks
Task agents
AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.
Autonomous Coding
Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.
HCAST
HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.
Tool Use
Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Agent Memory benchmarks accurate. Report outdated results, missing benchmarks, or errors.