TauBench (airline).

Name: TauBench (airline) Benchmark Results
Creator: Codesota
License: https://creativecommons.org/licenses/by/4.0/

τ²-Bench (TauBench) is a benchmark for evaluating conversational AI agents in dual-control environments, where both the agent and user can actively use tools to interact with a shared, dynamic world. Unlike traditional single-control benchmarks where only the AI agent uses tools, τ²-Bench models real-world scenarios like technical support where users need to actively participate in modifying the state of the shared environment. The benchmark introduces a novel Telecom domain modeled as a Dec-POMDP (Decentralized Partially Observable Markov Decision Process), testing both agent coordination and communication. It features a compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity. The benchmark includes a reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity. τ²-Bench provides fine-grained analysis of agent performance through multiple ablations, separating errors arising from reasoning vs communication/coordination. Experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. This variant focuses on the airline domain, where agents must help users with airline-related tasks (flight booking, reservations, etc.) in a dual-control environment.

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

No results indexed yet — be the first to submit a score.

No benchmark results indexed yet

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

TauBench (airline).

Best published scores.

Neighbouring benchmarks.

Have a score that beatsthis table?

Have a score that beats
this table?