Agentic benchmark testing tool-use capabilities across retail and airline customer service domains. Measures ability to use APIs and tools to resolve real-world tasks. Average pass rate across domains.
19 results indexed across 2 metrics. Shaded row marks current SOTA; ties broken by submission date.
| # | Model | Org | Submitted | Paper / code | accuracy |
|---|---|---|---|---|---|
| 01 | GLM-5Open | Zhipu AI | Feb 2026 | GLM-5: from Vibe Coding to Agentic Engineering · code | 89.70 |
| 02 | Step-3.5-Flash | — | Feb 2026 | Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code | 88.20 |
| 03 | Qwen3.5-397B-A17BOpen | Alibaba | Feb 2026 | pwc-dump · code | 86.70 |
| 04 | Qwen3.5-35B-A3BOpen | Alibaba | Feb 2026 | pwc-dump · code | 81.20 |
| 05 | Intern-S1-Pro | Shanghai AI Lab | Mar 2026 | Intern-S1-Pro: Scientific Multimodal Foundation Model at… | 80.90 |
| 06 | DeepSeek-V3.2Open | DeepSeek | Dec 2025 | DeepSeek-V3.2: Pushing the Frontier of Open Large Langua… | 80.30 |
| 07 | Qwen3.5-122B-A10BOpen | Alibaba | Feb 2026 | pwc-dump · code | 79.50 |
| 08 | Qwen3.5-27BOpen | Alibaba | Feb 2026 | pwc-dump · code | 79 |
| 09 | Ling-2.6-1T | — | Apr 2026 | pwc-dump | 78.36 |
| 10 | SenseNova-U1-A3B-MoT | SenseTime | May 2026 | SenseNova-U1: Unifying Multimodal Understanding and Gene… · code | 75.39 |
| 11 | NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | — | Dec 2025 | NVIDIA Nemotron 3: Efficient and Open Intelligence | 61.15 |
| # | Model | Org | Submitted | Paper / code | pass_rate |
|---|---|---|---|---|---|
| 01 | Claude Opus 4.5 | Anthropic | Nov 2025 | editorial | 79 |
| 02 | GPT-5.2 | OpenAI | Dec 2025 | editorial | 73 |
| 03 | Gemini 3 ProAPI | Nov 2025 | editorial | 69 | |
| 04 | Claude Sonnet 4.5 | Anthropic | Sep 2025 | editorial | 63 |
| 05 | GPT-5.1 | OpenAI | — | — | 59 |
| 06 | Gemini 2.5 Pro | — | — | 54 | |
| 07 | Claude 3.7 Sonnet | Anthropic | — | — | 47 |
| 08 | GPT-4oAPI | OpenAI | — | — | 36 |
Each row below marks a model that broke the previous record on pass_rate. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.
Higher scores win. Each subsequent entry improved upon the previous best.
Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.