Reasoning
Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.
AI reasoning has transformed in 2025 with test-time compute scaling rivaling traditional training approaches. Leading models now solve graduate-level problems through extended thinking, but cost and instruction-following trade-offs remain critical deployment considerations.
State of the Field (Dec 2024)
- -OpenAI's o3 and o4-mini achieve 98-99% on AIME math competition, while DeepSeek-R1 matches o1 performance as open-source MIT-licensed model
- -Test-time compute scaling now rivals training-time scaling: smaller models with extended inference match or exceed larger models on complex reasoning tasks
- -Gemini 3 Pro leads on multimodal reasoning (93.8% GPQA Diamond, 45.1% ARC-AGI-2), Claude 3.5 Sonnet excels at qualitative reasoning (59.4% GPQA)
- -Reasoning models struggle with instruction-following and exact arithmetic despite solving conceptually sophisticated problems - fundamental trade-off between reasoning depth and controllability
Quick Recommendations
Math Problem Solving (High Accuracy)
OpenAI o4-mini
99.5% on AIME 2025 with interpreter, best cost-performance ratio for mathematical reasoning
Graduate-Level Scientific Reasoning
Gemini 3 Deep Think
93.8% on GPQA Diamond, 41.0% on Humanity's Last Exam, excels at cross-domain scientific analysis
Coding Challenges & SWE Tasks
OpenAI o3 with thinking mode
74.9% on SWE-bench Verified, 89th percentile Codeforces, superior tool use and agentic capabilities
Qualitative Analysis & Multi-Perspective Reasoning
Claude 3.5 Sonnet
59.4% GPQA vs GPT-4o's 53.6%, 2x faster than Claude 3 Opus, excels at analytical thinking beyond pure math
Open-Weight Reasoning (Production Deployment)
DeepSeek-R1 or R1-Distill variants
MIT license, matches o1 performance, distilled variants offer competitive reasoning at fraction of frontier costs
Multimodal Reasoning (Vision + Text)
Gemini 3 Pro
87.6% Video-MMMU, 81% MMMU-Pro, 1M token context window for complex multi-document reasoning
Agentic Planning & Decision-Making
Hybrid: o3 for planning + GPT-4o for execution
Use reasoning models for decomposition and decisions, faster models for execution. 3-4x system performance improvement
Cost-Conscious Reasoning at Scale
Qwen3-32B or QwQ-32B
32B parameters with competitive reasoning, 256K context (up to 1M), state-of-the-art among open-weight thinking models
General-Purpose Chat (NOT Complex Reasoning)
GPT-4o or Claude 3.5 Sonnet
Faster, cheaper, better instruction-following. Reasoning models are overkill for information retrieval and simple tasks
Tasks & Benchmarks
Commonsense Reasoning
Reasoning about everyday situations (CommonsenseQA, HellaSwag).
Mathematical Reasoning
Solving math word problems (GSM8K, MATH, Minerva).
Multi-step Reasoning
Complex reasoning requiring multiple inference steps (HotpotQA).
Arithmetic Reasoning
Performing arithmetic calculations and solving equations.
Logical Reasoning
Solving logic puzzles and deductive problems.
Show all datasets and SOTA results
Commonsense Reasoning
7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.
12,247 multiple choice questions requiring commonsense reasoning about everyday concepts.
70K sentence completion problems testing commonsense natural language inference.
15,908 multiple choice questions across 57 subjects from elementary to professional level.
44K Winograd-style problems requiring commonsense reasoning to resolve pronoun references.
Mathematical Reasoning
30 challenging math problems from the 2024 AIME competition. Tests advanced mathematical reasoning.
8,500 grade school math word problems requiring multi-step reasoning. The most popular math reasoning benchmark.
12,500 competition mathematics problems from AMC, AIME, and other sources. Harder than GSM8K.
Multi-step Reasoning
448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.
113K question-answer pairs requiring reasoning over multiple Wikipedia documents.
2,780 yes/no questions requiring implicit multi-step reasoning to answer.
Arithmetic Reasoning
3,320 arithmetic word problems from various sources, testing basic arithmetic reasoning.
1,000 elementary-level math word problems testing robustness of arithmetic reasoning.
Logical Reasoning
8,678 logical reasoning questions from National Civil Servants Examinations of China.
6,138 reading comprehension questions requiring logical reasoning from GMAT/LSAT exams.
Honest Takes
Don't Default to Reasoning Models
For most tasks - customer service, content generation, classification - standard LLMs like GPT-4o or Claude 3.5 Sonnet remain superior. Reasoning models waste compute on simple tasks and cost 3-10x more due to token consumption. Reserve them for genuinely complex multi-step problems.
Instruction-Following Degrades with Reasoning
Analysis of 23 reasoning models reveals widespread inability to follow user constraints, especially on harder problems. Models trained with extended CoT sacrifice controllability for reasoning depth. If your app requires strict compliance with specifications, standard models may outperform reasoning models.
Open-Weight Can Be More Expensive
DeepSeek-R1 and Qwen3 generate 1.5-4x more tokens than closed models for equivalent reasoning. Lower per-token pricing doesn't always mean lower total cost. Benchmark on your actual workload before assuming open-weight saves money.
Benchmark Saturation is Real
GPQA Diamond approaches saturation at 90%+ accuracy. AIME questions show data contamination risk - models perform better on 2024 vs 2025 questions. Internal evaluation on private, domain-specific problems matters more than public benchmark scores.
Latent Reasoning is the Next Frontier
Current reasoning models burn tokens generating natural language traces. The future is latent reasoning - internal compressed representations that preserve benefits without token overhead. This could fundamentally alter reasoning model economics in 2025-2026.