Codesota · Registry · ReasoningThe area-level registerIssue: April 22, 2026
Area hub · Reasoning

Reasoning,
graded.

Chain-of-thought, test-time compute, graduate-level exams. The area where benchmark saturation meets instruction-following trade-offs — and where the cost column matters as much as the score.

AI reasoning has transformed in 2025 with test-time compute scaling rivaling traditional training approaches. Leading models now solve graduate-level problems through extended thinking, but cost and instruction-following trade-offs remain critical deployment considerations.

§ 01 · Top tasks

Sub-tasks in reasoning.

Each task opens onto a leaderboard of its canonical benchmark, with the full submission history and dated scores. Tasks without an indexed result are listed elsewhere in the register; the table below is sorted by result count.

Fig 01 · Showing top 5 of 5 tasks under Reasoning.
§ 02 · Top benchmarks

Current state of the art.

Leading scores for the headline benchmarks in this area, drawn from the registry. Shaded rows mark the top result per task; follow any row into the full leaderboard.

#TaskBenchmarkLeading modelScore
01Mathematical ReasoningGrade School Math 8KERNIE 5.099.7%
accuracy
02Commonsense ReasoningAI2 Reasoning Challengeo398.1%
accuracy
03Arithmetic ReasoningMath Word Problem RepositoryGPT-4o97.2%
accuracy
04Multi-step ReasoningBIG-Bench Hard (BBH)Claude 3.5 Sonnet93.1%
accuracy
05Logical ReasoningAbstraction and Reasoning Corpus for AGI (v1)o387.5%
accuracy
Fig 02 · Headline benchmarks for Reasoning. Full leaderboards, dated history and reproduction status live on the task pages.
Side note

State of the Field (2025)

  • 01OpenAI's o3 and o4-mini achieve 98-99% on AIME math competition, while DeepSeek-R1 matches o1 performance as open-source MIT-licensed model
  • 02Test-time compute scaling now rivals training-time scaling: smaller models with extended inference match or exceed larger models on complex reasoning tasks
  • 03Gemini 3 Pro leads on multimodal reasoning (93.8% GPQA Diamond, 45.1% ARC-AGI-2), Claude 3.5 Sonnet excels at qualitative reasoning (59.4% GPQA)
  • 04Reasoning models struggle with instruction-following and exact arithmetic despite solving conceptually sophisticated problems - fundamental trade-off between reasoning depth and controllability
Picks by use-case

What to reach for.

Editorial picks · not vendor rankings
Math Problem Solving (High Accuracy)
OpenAI o4-mini

99.5% on AIME 2025 with interpreter, best cost-performance ratio for mathematical reasoning

Graduate-Level Scientific Reasoning
Gemini 3 Deep Think

93.8% on GPQA Diamond, 41.0% on Humanity's Last Exam, excels at cross-domain scientific analysis

Coding Challenges & SWE Tasks
OpenAI o3 with thinking mode

74.9% on SWE-bench Verified, 89th percentile Codeforces, superior tool use and agentic capabilities

Qualitative Analysis & Multi-Perspective Reasoning
Claude 3.5 Sonnet

59.4% GPQA vs GPT-4o's 53.6%, 2x faster than Claude 3 Opus, excels at analytical thinking beyond pure math

Open-Weight Reasoning (Production Deployment)
DeepSeek-R1 or R1-Distill variants

MIT license, matches o1 performance, distilled variants offer competitive reasoning at fraction of frontier costs

Multimodal Reasoning (Vision + Text)
Gemini 3 Pro

87.6% Video-MMMU, 81% MMMU-Pro, 1M token context window for complex multi-document reasoning

Agentic Planning & Decision-Making
Hybrid: o3 for planning + GPT-4o for execution

Use reasoning models for decomposition and decisions, faster models for execution. 3-4x system performance improvement

Cost-Conscious Reasoning at Scale
Qwen3-32B or QwQ-32B

32B parameters with competitive reasoning, 256K context (up to 1M), state-of-the-art among open-weight thinking models

General-Purpose Chat (NOT Complex Reasoning)
GPT-4o or Claude 3.5 Sonnet

Faster, cheaper, better instruction-following. Reasoning models are overkill for information retrieval and simple tasks

Editor's note

Honest takes.

Don't Default to Reasoning Models

For most tasks - customer service, content generation, classification - standard LLMs like GPT-4o or Claude 3.5 Sonnet remain superior. Reasoning models waste compute on simple tasks and cost 3-10x more due to token consumption. Reserve them for genuinely complex multi-step problems.

Instruction-Following Degrades with Reasoning

Analysis of 23 reasoning models reveals widespread inability to follow user constraints, especially on harder problems. Models trained with extended CoT sacrifice controllability for reasoning depth. If your app requires strict compliance with specifications, standard models may outperform reasoning models.

Open-Weight Can Be More Expensive

DeepSeek-R1 and Qwen3 generate 1.5-4x more tokens than closed models for equivalent reasoning. Lower per-token pricing doesn't always mean lower total cost. Benchmark on your actual workload before assuming open-weight saves money.

Benchmark Saturation is Real

GPQA Diamond approaches saturation at 90%+ accuracy. AIME questions show data contamination risk - models perform better on 2024 vs 2025 questions. Internal evaluation on private, domain-specific problems matters more than public benchmark scores.

Latent Reasoning is the Next Frontier

Current reasoning models burn tokens generating natural language traces. The future is latent reasoning - internal compressed representations that preserve benefits without token overhead. This could fundamentally alter reasoning model economics in 2025-2026.

§ 03 · Method
How this area is tracked

Every row in this register is dated and sourced.

The benchmarks above come from the same Postgres registry that powers the wider Codesota index. Each task has exactly one canonical dataset. Each score carries a metric direction, a date and — where possible — a reproduction status.

When a score regresses, the prior record stays visible. When a benchmark is contested, we mark it rather than delete it. The goal is a register that argues in public.

Full methodology The unified task index
§ Final · Related

Neighbouring registers.

Sibling area hubs, the unified task index and the methodology that binds them.