Task agents
AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.
Task agents is a key task in agentic ai. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
AcademiClaw
AcademiClaw: agentic frontier tasks benchmark
Primary benchmark dataset for AcademiClaw: When Students Set Challenges for AI Agents.
State of the Art
Gemini 3.1 Pro
Anthropic/OpenAI
2857
avg-tokens-per-task-k
TauBench (airline)
TauBench (airline)
τ²-Bench (TauBench) is a benchmark for evaluating conversational AI agents in dual-control environments, where both the agent and user can actively use tools to interact with a shared, dynamic world. Unlike traditional single-control benchmarks where only the AI agent uses tools, τ²-Bench models real-world scenarios like technical support where users need to actively participate in modifying the state of the shared environment. The benchmark introduces a novel Telecom domain modeled as a Dec-POMDP (Decentralized Partially Observable Markov Decision Process), testing both agent coordination and communication. It features a compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity. The benchmark includes a reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity. τ²-Bench provides fine-grained analysis of agent performance through multiple ablations, separating errors arising from reasoning vs communication/coordination. Experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. This variant focuses on the airline domain, where agents must help users with airline-related tasks (flight booking, reservations, etc.) in a dual-control environment.
No results tracked yet
Terminal Bench
Terminal-Bench is a framework and set of tasks for evaluating how well AI agents can accomplish complex tasks in a terminal environment. It consists of a dataset of tasks and an execution harness. Each task includes a description in English, a Docker environment, and a test script to verify successful completion by the agent.
No results tracked yet
TauBench (retail)
TauBench (retail)
τ²-Bench (TauBench) is a benchmark for evaluating conversational AI agents in dual-control environments, where both the agent and user can actively use tools to interact with a shared, dynamic world. Unlike traditional single-control benchmarks where only the AI agent uses tools, τ²-Bench models real-world scenarios like technical support where users need to actively participate in modifying the state of the shared environment. The benchmark introduces a novel Telecom domain modeled as a Dec-POMDP (Decentralized Partially Observable Markov Decision Process), testing both agent coordination and communication. It features a compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity. The benchmark includes a reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity. τ²-Bench provides fine-grained analysis of agent performance through multiple ablations, separating errors arising from reasoning vs communication/coordination. Experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. This variant focuses on the retail domain, where agents must help users with retail-related tasks in a dual-control environment.
No results tracked yet
BFCL
BFCL is a comprehensive benchmark designed to evaluate the function calling (also known as tool use) capabilities of Large Language Models (LLMs) in a wide range of real-world settings. It assesses models across various scenarios, including serial (simple), parallel, and multi-turn interactions, and evaluates agentic capabilities such as reasoning in stateful multi-step environments, memory, web search, and format sensitivity.
No results tracked yet
PhysicianBench
No results tracked yet
Nexus
NexusBench (Nexus function-calling / tool-use benchmark, Nexusflow)
Nexus (NexusBench) is a collection of function-calling / tool-use evaluation benchmarks and associated evaluation data released by Nexusflow (Nexusflow.ai). It is intended to measure LLMs' ability to (a) select and call external functions/APIs correctly, (b) parameterize tool calls, and (c) perform multi-step agentic workflows. The Nexus project is provided as a benchmark suite (NexusBench) on GitHub and as multiple evaluation datasets on Hugging Face (examples include Nexusflow/NexusRaven_API_evaluation, Function_Call_Definitions, and several per-task benchmark shards such as VirusTotalBenchmark, NVDLibraryBenchmark, TicketTrackingBenchmark, etc.). The suite is used as the basis for Nexusflow's function-calling leaderboard and model evaluations (e.g., NexusRaven). Primary sources: NexusBench GitHub repository (nexusflowai/NexusBench) and Nexusflow Hugging Face datasets (Nexusflow/*).
No results tracked yet
Related Tasks
Autonomous Coding
Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.
HCAST
HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.
Tool Use
Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.
Bioinformatics Agents
LLM-agent benchmarks for computational biology — exploring datasets, running multi-step analyses, and interpreting biological results.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Task agents benchmarks accurate. Report outdated results, missing benchmarks, or errors.