Agentic AI

Agent Memory

Benchmarks for persistent agent memory: recall, updates, deletes, contradictions, valid-time beliefs, and evidence-backed preflight decisions.

2 datasets0 resultsView full task mapping →

Agent Memory is a key task in agentic ai. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

Agent Memory Benchmark

Agent Memory Benchmark (AMB)

20260 results

Provider-oriented benchmark for evaluating agent memory systems across retrieval, lifecycle operations, contradictions, and evidence-backed behavior. Tracked as the preferred route for comparable memory-provider claims.

No results tracked yet

Audrey Memory Benchmark Artifacts

Audrey public memory benchmark artifacts

20260 results

Public Audrey deterministic regression and performance artifacts for local memory behavior. Tracked as supporting evidence only until Audrey is evaluated through the official AMB harness.

No results tracked yet

Related Tasks

Task agents

AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.

Autonomous Coding

Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.

HCAST

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

Tool Use

Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Agent Memory benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Agentic AI