Codesota · Agentic AI · Tool Use · Tau2-BenchTasks/Agentic AI/Tool Use
Tool Use · benchmark dataset · 2024 · EN

Tau2-Bench: Agentic Tool-Use Benchmark.

Agentic benchmark testing tool-use capabilities across retail and airline customer service domains. Measures ability to use APIs and tools to resolve real-world tasks. Average pass rate across domains.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

19 results indexed across 2 metrics. Shaded row marks current SOTA; ties broken by submission date.


Primary
pass_rate · higher is better
All metrics
accuracy, pass_rate
accuracy
11 rows
#ModelOrgSubmittedPaper / codeaccuracy
01GLM-5OpenZhipu AIFeb 2026GLM-5: from Vibe Coding to Agentic Engineering · code89.70
02Step-3.5-FlashFeb 2026Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code88.20
03Qwen3.5-397B-A17BOpenAlibabaFeb 2026pwc-dump · code86.70
04Qwen3.5-35B-A3BOpenAlibabaFeb 2026pwc-dump · code81.20
05Intern-S1-ProShanghai AI LabMar 2026Intern-S1-Pro: Scientific Multimodal Foundation Model at…80.90
06DeepSeek-V3.2OpenDeepSeekDec 2025DeepSeek-V3.2: Pushing the Frontier of Open Large Langua…80.30
07Qwen3.5-122B-A10BOpenAlibabaFeb 2026pwc-dump · code79.50
08Qwen3.5-27BOpenAlibabaFeb 2026pwc-dump · code79
09Ling-2.6-1TApr 2026pwc-dump78.36
10SenseNova-U1-A3B-MoTSenseTimeMay 2026SenseNova-U1: Unifying Multimodal Understanding and Gene… · code75.39
11NVIDIA-Nemotron-3-Super-120B-A12B-BF16Dec 2025NVIDIA Nemotron 3: Efficient and Open Intelligence61.15
pass_rate· primary
8 rows
#ModelOrgSubmittedPaper / codepass_rate
01Claude Opus 4.5AnthropicNov 2025editorial79
02GPT-5.2OpenAIDec 2025editorial73
03Gemini 3 ProAPIGoogleNov 2025editorial69
04Claude Sonnet 4.5AnthropicSep 2025editorial63
05GPT-5.1OpenAI59
06Gemini 2.5 ProGoogle54
07Claude 3.7 SonnetAnthropic47
08GPT-4oAPIOpenAI36
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

3 steps
of state of the art.

Each row below marks a model that broke the previous record on pass_rate. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · pass_rate
  1. Sep 29, 2025Claude Sonnet 4.5Anthropic63
  2. Nov 18, 2025Gemini 3 ProGoogle69
  3. Nov 24, 2025Claude Opus 4.5Anthropic79
Fig 3 · SOTA-setting models only. 3 entries span Sep 2025 Nov 2025.
§ 04 · Literature

6 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies