Agentic benchmark testing tool-use capabilities across retail and airline customer service domains. Measures ability to use APIs and tools to resolve real-world tasks. Average pass rate across domains.
Accuracy is the reported evaluation metric for Tau2-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | GLM-5 | unverified | 89.7 | 2026 | Paper ↗Code ↗Source ↗ | Looks wrong? |
| 02 | Step-3.5-Flash | unverified | 88.2 | 2026 | Paper ↗Code ↗ | Looks wrong? |
| 03 | Qwen3.5-397B-A17B | unverified | 86.7 | 2026 | Paper ↗Code ↗Source ↗ | Looks wrong? |
| 04 | Qwen3.5-35B-A3B | unverified | 81.2 | 2026 | Paper ↗Code ↗Source ↗ | Looks wrong? |
| 05 | Intern-S1-Pro | unverified | 80.9 | 2026 | Paper ↗Source ↗ | Looks wrong? |
| 06 | DeepSeek-V3.2 | unverified | 80.3 | 2025 | Paper ↗Source ↗ | Looks wrong? |
| 07 | Qwen3.5-122B-A10B | unverified | 79.5 | 2026 | Paper ↗Code ↗Source ↗ | Looks wrong? |
| 08 | Qwen3.5-27B | unverified | 79 | 2026 | Paper ↗Code ↗Source ↗ | Looks wrong? |
| 09 | Ling-2.6-1T | unverified | 78.36 | 2026 | Paper ↗ | Looks wrong? |
| 10 | SenseNova-U1-A3B-MoT | unverified | 75.39 | 2026 | Paper ↗Code ↗ | Looks wrong? |
| 11 | NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | unverified | 61.15 | 2025 | Paper ↗Source ↗ | Looks wrong? |
Pass Rate is the reported evaluation metric for Tau2-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Claude Opus 4.5 | verified | 79 | 2025 | Source ↗ | Looks wrong? |
| 02 | GPT-5.2 | verified | 73 | 2025 | Source ↗ | Looks wrong? |
| 03 | Gemini 3 Pro | verified | 69 | 2025 | Source ↗ | Looks wrong? |
| 04 | Claude Sonnet 4.5 | verified | 63 | 2025 | Source ↗ | Looks wrong? |
| 05 | GPT-5.1 | paper | 59 | N/A | N/A | Looks wrong? |
| 06 | Gemini 2.5 Pro | paper | 54 | N/A | N/A | Looks wrong? |
| 07 | Claude 3.7 Sonnet | paper | 47 | N/A | N/A | Looks wrong? |
| 08 | GPT-4o | paper | 36 | N/A | N/A | Looks wrong? |