Codesota · Benchmark · Tau2-BenchHome/Leaderboards/Tau2-Bench
Unknown

Tau2-Bench.

Agentic benchmark testing tool-use capabilities across retail and airline customer service domains. Measures ability to use APIs and tools to resolve real-world tasks. Average pass rate across domains.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Accuracy

Accuracy is the reported evaluation metric for Tau2-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified

Pass Rate

Pass Rate is the reported evaluation metric for Tau2-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pass Rateverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Claude Opus 4.5
backfilled 2026-04-23 from anthropic.com
verified792025Source ↗Looks wrong?
02GPT-5.2
backfilled 2026-04-23 from openai.com
verified732025Source ↗Looks wrong?
03Gemini 3 Pro
backfilled 2026-04-23 from deepmind.google
verified692025Source ↗Looks wrong?
04Claude Sonnet 4.5
backfilled 2026-04-23 from anthropic.com
verified632025Source ↗Looks wrong?
05GPT-5.1
source search failed 2026-04-23
paper59N/AN/ALooks wrong?
06Gemini 2.5 Pro
source search failed 2026-04-23
paper54N/AN/ALooks wrong?
07Claude 3.7 Sonnet
source search failed 2026-04-23
paper47N/AN/ALooks wrong?
08GPT-4o
source search failed 2026-04-23
paper36N/AN/ALooks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards