Reasoning

Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.

5 tasks15 datasets51 results

AI reasoning has transformed in 2025 with test-time compute scaling rivaling traditional training approaches. Leading models now solve graduate-level problems through extended thinking, but cost and instruction-following trade-offs remain critical deployment considerations.

State of the Field (Dec 2024)

  • -OpenAI's o3 and o4-mini achieve 98-99% on AIME math competition, while DeepSeek-R1 matches o1 performance as open-source MIT-licensed model
  • -Test-time compute scaling now rivals training-time scaling: smaller models with extended inference match or exceed larger models on complex reasoning tasks
  • -Gemini 3 Pro leads on multimodal reasoning (93.8% GPQA Diamond, 45.1% ARC-AGI-2), Claude 3.5 Sonnet excels at qualitative reasoning (59.4% GPQA)
  • -Reasoning models struggle with instruction-following and exact arithmetic despite solving conceptually sophisticated problems - fundamental trade-off between reasoning depth and controllability

Quick Recommendations

Math Problem Solving (High Accuracy)

OpenAI o4-mini

99.5% on AIME 2025 with interpreter, best cost-performance ratio for mathematical reasoning

Graduate-Level Scientific Reasoning

Gemini 3 Deep Think

93.8% on GPQA Diamond, 41.0% on Humanity's Last Exam, excels at cross-domain scientific analysis

Coding Challenges & SWE Tasks

OpenAI o3 with thinking mode

74.9% on SWE-bench Verified, 89th percentile Codeforces, superior tool use and agentic capabilities

Qualitative Analysis & Multi-Perspective Reasoning

Claude 3.5 Sonnet

59.4% GPQA vs GPT-4o's 53.6%, 2x faster than Claude 3 Opus, excels at analytical thinking beyond pure math

Open-Weight Reasoning (Production Deployment)

DeepSeek-R1 or R1-Distill variants

MIT license, matches o1 performance, distilled variants offer competitive reasoning at fraction of frontier costs

Multimodal Reasoning (Vision + Text)

Gemini 3 Pro

87.6% Video-MMMU, 81% MMMU-Pro, 1M token context window for complex multi-document reasoning

Agentic Planning & Decision-Making

Hybrid: o3 for planning + GPT-4o for execution

Use reasoning models for decomposition and decisions, faster models for execution. 3-4x system performance improvement

Cost-Conscious Reasoning at Scale

Qwen3-32B or QwQ-32B

32B parameters with competitive reasoning, 256K context (up to 1M), state-of-the-art among open-weight thinking models

General-Purpose Chat (NOT Complex Reasoning)

GPT-4o or Claude 3.5 Sonnet

Faster, cheaper, better instruction-following. Reasoning models are overkill for information retrieval and simple tasks

Tasks & Benchmarks

Show all datasets and SOTA results

Commonsense Reasoning

ARC-ChallengeAI2 Reasoning Challenge2018
SOTA:96.7(accuracy)
Claude 3.5 Sonnet

7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.

CommonsenseQACommonsenseQA2019
SOTA:85.4(accuracy)
GPT-4o

12,247 multiple choice questions requiring commonsense reasoning about everyday concepts.

HellaSwagHellaSwag2019
SOTA:95.3(accuracy)
GPT-4o

70K sentence completion problems testing commonsense natural language inference.

MMLUMassive Multitask Language Understanding2021
SOTA:92.3(accuracy)
o1-preview

15,908 multiple choice questions across 57 subjects from elementary to professional level.

WinoGrandeWinoGrande2019
SOTA:87.5(accuracy)
GPT-4o

44K Winograd-style problems requiring commonsense reasoning to resolve pronoun references.

Mathematical Reasoning

AIME 2024American Invitational Mathematics Examination 20242024
SOTA:83.3(accuracy)
o1-preview

30 challenging math problems from the 2024 AIME competition. Tests advanced mathematical reasoning.

GSM8KGrade School Math 8K2021
SOTA:97.8(accuracy)
o1-preview

8,500 grade school math word problems requiring multi-step reasoning. The most popular math reasoning benchmark.

MATHMathematics Aptitude Test of Heuristics2021
SOTA:94.8(accuracy)
o1-preview

12,500 competition mathematics problems from AMC, AIME, and other sources. Harder than GSM8K.

Multi-step Reasoning

GPQAGraduate-Level Google-Proof Q&A2024
SOTA:78(accuracy)
o1-preview

448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.

HotpotQAHotpotQA2018
SOTA:71.3(f1)
GPT-4o

113K question-answer pairs requiring reasoning over multiple Wikipedia documents.

StrategyQAStrategyQA2021
SOTA:82.1(accuracy)
GPT-4o

2,780 yes/no questions requiring implicit multi-step reasoning to answer.

Arithmetic Reasoning

MAWPSMath Word Problem Repository2016
SOTA:97.2(accuracy)
GPT-4o

3,320 arithmetic word problems from various sources, testing basic arithmetic reasoning.

SVAMPSimple Variations on Arithmetic Math Word Problems2021
SOTA:93.7(accuracy)
GPT-4o

1,000 elementary-level math word problems testing robustness of arithmetic reasoning.

Logical Reasoning

LogiQALogiQA2020
SOTA:56.3(accuracy)
GPT-4o

8,678 logical reasoning questions from National Civil Servants Examinations of China.

ReClorReading Comprehension Dataset Requiring Logical Reasoning2020
SOTA:72.4(accuracy)
GPT-4o

6,138 reading comprehension questions requiring logical reasoning from GMAT/LSAT exams.

Honest Takes

Don't Default to Reasoning Models

For most tasks - customer service, content generation, classification - standard LLMs like GPT-4o or Claude 3.5 Sonnet remain superior. Reasoning models waste compute on simple tasks and cost 3-10x more due to token consumption. Reserve them for genuinely complex multi-step problems.

Instruction-Following Degrades with Reasoning

Analysis of 23 reasoning models reveals widespread inability to follow user constraints, especially on harder problems. Models trained with extended CoT sacrifice controllability for reasoning depth. If your app requires strict compliance with specifications, standard models may outperform reasoning models.

Open-Weight Can Be More Expensive

DeepSeek-R1 and Qwen3 generate 1.5-4x more tokens than closed models for equivalent reasoning. Lower per-token pricing doesn't always mean lower total cost. Benchmark on your actual workload before assuming open-weight saves money.

Benchmark Saturation is Real

GPQA Diamond approaches saturation at 90%+ accuracy. AIME questions show data contamination risk - models perform better on 2024 vs 2025 questions. Internal evaluation on private, domain-specific problems matters more than public benchmark scores.

Latent Reasoning is the Next Frontier

Current reasoning models burn tokens generating natural language traces. The future is latent reasoning - internal compressed representations that preserve benefits without token overhead. This could fundamentally alter reasoning model economics in 2025-2026.