τ²-Bench (TauBench) is a benchmark for evaluating conversational AI agents in dual-control environments, where both the agent and user can actively use tools to interact with a shared, dynamic world. Unlike traditional single-control benchmarks where only the AI agent uses tools, τ²-Bench models real-world scenarios like technical support where users need to actively participate in modifying the state of the shared environment. The benchmark introduces a novel Telecom domain modeled as a Dec-POMDP (Decentralized Partially Observable Markov Decision Process), testing both agent coordination and communication. It features a compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity. The benchmark includes a reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity. τ²-Bench provides fine-grained analysis of agent performance through multiple ablations, separating errors arising from reasoning vs communication/coordination. Experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. This variant focuses on the airline domain, where agents must help users with airline-related tasks (flight booking, reservations, etc.) in a dual-control environment.
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.