Who leads the Tau2-Bench benchmark?

GLM-5 currently leads Tau2-Bench with a score of 89.7 on Accuracy.

What is the state-of-the-art score on Tau2-Bench?

The state-of-the-art result on Tau2-Bench is 89.7 (Accuracy), achieved by GLM-5 as of 2026.

How many models are tracked on Tau2-Bench?

Codesota tracks 19 models on Tau2-Bench across 2 metrics.

When was the Tau2-Bench leaderboard last updated?

The Tau2-Bench leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2025.

Codesota · Benchmark · Tau2-BenchHome/Leaderboards/Tau2-Bench

Unknown

Tau2-Bench.

Name: Tau2-Bench Benchmark Results
Creator: Unknown
Published: 2025-01-01
License: https://creativecommons.org/licenses/by/4.0/

Agentic benchmark testing tool-use capabilities across retail and airline customer service domains. Measures ability to use APIs and tools to resolve real-world tasks. Average pass rate across domains.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

Accuracy

Accuracy is the reported evaluation metric for Tau2-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	GLM-5	unverified	89.7	2026	Paper ↗Code ↗Source ↗	Looks wrong?
02	Step-3.5-Flash	unverified	88.2	2026	Paper ↗Code ↗	Looks wrong?
03	Qwen3.5-397B-A17B	unverified	86.7	2026	Paper ↗Code ↗Source ↗	Looks wrong?
04	Qwen3.5-35B-A3B	unverified	81.2	2026	Paper ↗Code ↗Source ↗	Looks wrong?
05	Intern-S1-Pro	unverified	80.9	2026	Paper ↗Source ↗	Looks wrong?
06	DeepSeek-V3.2	unverified	80.3	2025	Paper ↗Source ↗	Looks wrong?
07	Qwen3.5-122B-A10B	unverified	79.5	2026	Paper ↗Code ↗Source ↗	Looks wrong?
08	Qwen3.5-27B	unverified	79	2026	Paper ↗Code ↗Source ↗	Looks wrong?
09	Ling-2.6-1T	unverified	78.36	2026	Paper ↗	Looks wrong?
10	SenseNova-U1-A3B-MoT	unverified	75.39	2026	Paper ↗Code ↗	Looks wrong?
11	NVIDIA-Nemotron-3-Super-120B-A12B-BF16	unverified	61.15	2025	Paper ↗Source ↗	Looks wrong?

Pass Rate

Pass Rate is the reported evaluation metric for Tau2-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pass Rateverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Claude Opus 4.5 backfilled 2026-04-23 from anthropic.com	verified	79	2025	Source ↗	Looks wrong?
02	GPT-5.2 backfilled 2026-04-23 from openai.com	verified	73	2025	Source ↗	Looks wrong?
03	Gemini 3 Pro backfilled 2026-04-23 from deepmind.google	verified	69	2025	Source ↗	Looks wrong?
04	Claude Sonnet 4.5 backfilled 2026-04-23 from anthropic.com	verified	63	2025	Source ↗	Looks wrong?
05	GPT-5.1 source search failed 2026-04-23	paper	59	N/A	N/A	Looks wrong?
06	Gemini 2.5 Pro source search failed 2026-04-23	paper	54	N/A	N/A	Looks wrong?
07	Claude 3.7 Sonnet source search failed 2026-04-23	paper	47	N/A	N/A	Looks wrong?
08	GPT-4o source search failed 2026-04-23	paper	36	N/A	N/A	Looks wrong?

§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards