Codesota · Agentic AI · Task agents · TauBench (retail)Tasks/Agentic AI/Task agents
Task agents · benchmark dataset · EN

TauBench (retail).

τ²-Bench (TauBench) is a benchmark for evaluating conversational AI agents in dual-control environments, where both the agent and user can actively use tools to interact with a shared, dynamic world. Unlike traditional single-control benchmarks where only the AI agent uses tools, τ²-Bench models real-world scenarios like technical support where users need to actively participate in modifying the state of the shared environment. The benchmark introduces a novel Telecom domain modeled as a Dec-POMDP (Decentralized Partially Observable Markov Decision Process), testing both agent coordination and communication. It features a compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity. The benchmark includes a reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity. τ²-Bench provides fine-grained analysis of agent performance through multiple ablations, separating errors arising from reasoning vs communication/coordination. Experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. This variant focuses on the retail domain, where agents must help users with retail-related tasks in a dual-control environment.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

No results indexed yet — be the first to submit a score.

No benchmark results indexed yet
§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies
TauBench (retail) — Task agents benchmark · Codesota | CodeSOTA