Tool Use2024en

Tau2-Bench: Agentic Tool-Use Benchmark

Agentic benchmark testing tool-use capabilities across retail and airline customer service domains. Measures ability to use APIs and tools to resolve real-world tasks. Average pass rate across domains.

Metrics:pass_rate
Paper / Website
Current State of the Art

Claude Opus 4.5

Anthropic

79

pass_rate

Tau2-Bench — pass_rate

8 results · 1 SOTA advances · higher is better

All results
SOTA frontier
405060708020262027pass_rateClaude Opus 4.5

Top Models Performance Comparison

Top 8 models ranked by pass_rate

pass_rate1Claude Opus 4.579.0100.0%2GPT-5.273.092.4%3Gemini 3 Pro69.087.3%4Claude Sonnet 4.563.079.7%5GPT-5.159.074.7%6Gemini 2.5 Pro54.068.4%7Claude 3.7 Sonnet47.059.5%8GPT-4o36.045.6%0%25%50%75%100%% of best
Best Score
79.0
Top Model
Claude Opus 4.5
Models Compared
8
Score Range
43.0

pass_ratePrimary

#ModelScorePaper / CodeDate
1
Claude Opus 4.5
Anthropic
79
-
2
GPT-5.2
OpenAI
73
-
3
Gemini 3 Pro
Google
69
-
4
Claude Sonnet 4.5
Anthropic
63
-
5
GPT-5.1
OpenAI
59
-
6
Gemini 2.5 Pro
Google
54
-
7
Claude 3.7 Sonnet
Anthropic
47
-
8
GPT-4oAPI
OpenAI
36
-