99.5% on AIME 2025 with interpreter, best cost-performance ratio for mathematical reasoning
Chain-of-thought, test-time compute, graduate-level exams. The area where benchmark saturation meets instruction-following trade-offs — and where the cost column matters as much as the score.
AI reasoning has transformed in 2025 with test-time compute scaling rivaling traditional training approaches. Leading models now solve graduate-level problems through extended thinking, but cost and instruction-following trade-offs remain critical deployment considerations.
Each task opens onto a leaderboard of its canonical benchmark, with the full submission history and dated scores. Tasks without an indexed result are listed elsewhere in the register; the table below is sorted by result count.
Leading scores for the headline benchmarks in this area, drawn from the registry. Shaded rows mark the top result per task; follow any row into the full leaderboard.
| # | Task | Benchmark | Leading model | Score |
|---|---|---|---|---|
| 01 | Mathematical Reasoning | Grade School Math 8K | ERNIE 5.0 | 99.7% accuracy |
| 02 | Commonsense Reasoning | AI2 Reasoning Challenge | o3 | 98.1% accuracy |
| 03 | Arithmetic Reasoning | Math Word Problem Repository | GPT-4o | 97.2% accuracy |
| 04 | Multi-step Reasoning | BIG-Bench Hard (BBH) | Claude 3.5 Sonnet | 93.1% accuracy |
| 05 | Logical Reasoning | Abstraction and Reasoning Corpus for AGI (v1) | o3 | 87.5% accuracy |
99.5% on AIME 2025 with interpreter, best cost-performance ratio for mathematical reasoning
93.8% on GPQA Diamond, 41.0% on Humanity's Last Exam, excels at cross-domain scientific analysis
74.9% on SWE-bench Verified, 89th percentile Codeforces, superior tool use and agentic capabilities
59.4% GPQA vs GPT-4o's 53.6%, 2x faster than Claude 3 Opus, excels at analytical thinking beyond pure math
MIT license, matches o1 performance, distilled variants offer competitive reasoning at fraction of frontier costs
87.6% Video-MMMU, 81% MMMU-Pro, 1M token context window for complex multi-document reasoning
Use reasoning models for decomposition and decisions, faster models for execution. 3-4x system performance improvement
32B parameters with competitive reasoning, 256K context (up to 1M), state-of-the-art among open-weight thinking models
Faster, cheaper, better instruction-following. Reasoning models are overkill for information retrieval and simple tasks
For most tasks - customer service, content generation, classification - standard LLMs like GPT-4o or Claude 3.5 Sonnet remain superior. Reasoning models waste compute on simple tasks and cost 3-10x more due to token consumption. Reserve them for genuinely complex multi-step problems.
Analysis of 23 reasoning models reveals widespread inability to follow user constraints, especially on harder problems. Models trained with extended CoT sacrifice controllability for reasoning depth. If your app requires strict compliance with specifications, standard models may outperform reasoning models.
DeepSeek-R1 and Qwen3 generate 1.5-4x more tokens than closed models for equivalent reasoning. Lower per-token pricing doesn't always mean lower total cost. Benchmark on your actual workload before assuming open-weight saves money.
GPQA Diamond approaches saturation at 90%+ accuracy. AIME questions show data contamination risk - models perform better on 2024 vs 2025 questions. Internal evaluation on private, domain-specific problems matters more than public benchmark scores.
Current reasoning models burn tokens generating natural language traces. The future is latent reasoning - internal compressed representations that preserve benefits without token overhead. This could fundamentally alter reasoning model economics in 2025-2026.
The benchmarks above come from the same Postgres registry that powers the wider Codesota index. Each task has exactly one canonical dataset. Each score carries a metric direction, a date and — where possible — a reproduction status.
When a score regresses, the prior record stays visible. When a benchmark is contested, we mark it rather than delete it. The goal is a register that argues in public.
Sibling area hubs, the unified task index and the methodology that binds them.