Codesota · LLM · GSM8K & MATHLLM/GSM8K / MATH
Math · updated April 2026

GSM8K & MATH.

Math reasoning performance from grade-school word problems (GSM8K) to competition-level proofs (MATH). Reasoning models like o3 and DeepSeek-R1 now dominate both benchmarks.

GSM8K MATH
§ 01 · GSM8K

Grade-school arithmetic, multi-step.

8,500 grade-school math word problems requiring 2-8 arithmetic steps. Chain-of-thought prompting unlocks near-perfect accuracy in frontier models. Accuracy measured on the 1,319-problem test set.

#ModelProviderAccuracyDate
ERNIE 5.0Baidu99.7%Apr 2026
2MiMo-V2.5-Pro99.6%Apr 2026
3GPT-5OpenAI99.2%Apr 2026
4o3OpenAI99%Mar 2026
5Gemini 2.5 ProGoogle99%Mar 2026
6o4-miniOpenAI99%Mar 2026
7Claude 4Anthropic98.9%Apr 2026
8Llama 4 MaverickMeta98.7%Mar 2026
9Claude Opus 4.5Anthropic98.6%Mar 2026
10Llama 4 Behemoth 2TMeta98.5%Apr 2026
11GPT-4.5OpenAI98.2%Apr 2026
12Claude Opus 4Anthropic98%Mar 2026
13o1-previewOpenAI97.8%Dec 2025
14o1OpenAI97.8%Apr 2026
15Claude Sonnet 4Anthropic97.8%Mar 2026
16o1OpenAI97.8%Apr 2026
17DeepSeek R1DeepSeek97.3%Mar 2026
18Llama 3 (405B, Instruct)Meta96.8%Jul 2024
19Claude 3.5 SonnetAnthropic96.4%Dec 2025
20Qwen2.5-Plus96%Dec 2024
21DeepSeek-V3DeepSeek95.8%Mar 2026
22Qwen2.5-72B-InstructAlibaba95.8%Mar 2026
23Qwen2.5-VL-72B95.3%Feb 2025
24Claude 3 OpusAnthropic95%Apr 2026
25Claude 3.5 SonnetAnthropic95%Apr 2026
26MiniMax-Text-01MiniMax94.8%Jan 2025
27MiniCPM-o 4.5-Instruct94.5%Apr 2026
28Gemini UltraGoogle DeepMind94.4%Apr 2026
29Qwen3-235B-A22BAlibaba94.39%May 2025
30Llama 3 70BMeta93%Dec 2025
31GPT-4OpenAI92%Apr 2026
32GPT-4OpenAI92%Apr 2026
33GPT-4oOpenAI92%Dec 2025
34Gemini 1.5 ProGoogle91.7%Dec 2025
35Claude 3 HaikuAnthropic88.9%Apr 2026
36Step-3.5-Flash Base88.2%Feb 2026
37Mixtral-8x22bMistral88%Apr 2026
38HRM-Text-1B84.7%May 2026
39Apertus-70B-Instruct77.6%Sep 2025
40PaLM 540B (Self-Consistency)Google74%Apr 2026
41LLaMA-65B69.7%Feb 2023
42Chameleon 34B61.4%May 2024
43BitNet b1.58 2B4T58.38%Apr 2025
44PaLM 540B (CoT)Google58%Apr 2026
45Llama 2 70B (5-shot)56.8%Jul 2023
46Code Llama - Python 34B34.42%Aug 2023
47SmoLM2 (1.7B)31.1%Feb 2025
48GPT-3 (base)OpenAI8%Apr 2026

Source: openai/grade-school-math · Chain-of-thought, maj@1.

§ 02 · MATH

Competition-level, AMC to AIME.

12,500 competition problems at difficulty 1-5 (AMC/AIME level), covering algebra, counting, geometry, number theory, and pre-calculus. The harder MATH-500 subset (500 representative problems) is the standard evaluation split.

#ModelProviderAccuracyDate
o4-mini (high)OpenAI98.2%Mar 2026
2o3 (high)OpenAI98.1%Mar 2026
3o3-miniOpenAI97.9%Mar 2026
4o3OpenAI97.8%Mar 2026
5o4-miniOpenAI97.5%Mar 2026
6Gemini 2.5 ProGoogle97.3%Mar 2026
7DeepSeek R1DeepSeek97.3%Mar 2026
8o1OpenAI96.4%Mar 2026
9Claude 3.7 SonnetAnthropic96.2%Mar 2026
10Kimi k1.5Moonshot AI96.2%Mar 2026
11DeepSeek-R1-ZeroDeepSeek95.9%Mar 2026
12DeepSeek-R1-Distill-Llama-70BDeepSeek94.5%Mar 2026
13DeepSeek-R1-Distill-Qwen-32BDeepSeek94.3%Mar 2026
14DeepSeek-v3-0324DeepSeek94%Mar 2026
15Claude Opus 4.5Anthropic90.7%Mar 2026
16QwQ-32BAlibaba/Qwen90.6%Mar 2026
17DeepSeek-V3DeepSeek90.2%Mar 2026
18o1-miniOpenAI90%Mar 2026
19Llama 4 MaverickMeta89.4%Mar 2026
20Claude Opus 4Anthropic89.2%Mar 2026
21Claude Sonnet 4Anthropic88.9%Mar 2026
22GPT-4.5 PreviewOpenAI87.1%Mar 2026
23o1-previewOpenAI85.5%Mar 2026
24Qwen2.5-Plus84.7%Dec 2024
25Qwen2.5-72B-InstructAlibaba83.1%Mar 2026
26Qwen2.5-VL-72B83%Feb 2025
27GPT-4.1OpenAI82.1%Mar 2026
28MiniMax-Text-01MiniMax77.4%Jan 2025
29GPT-4oOpenAI76.6%Mar 2026
30Grok 2xAI76.1%Mar 2026
31Llama 3 (405B, Instruct)Meta73.8%Jul 2024
32Llama 3.1 405BMeta73.8%Mar 2026
33GPT-4 TurboOpenAI73.4%Mar 2026
34Qwen3-235B-A22BAlibaba71.84%May 2025
35Claude 3.5 SonnetAnthropic71.1%Mar 2026
36GPT-4o miniOpenAI70.2%Mar 2026
37Llama 3.1 70BMeta68%Mar 2026
38Gemini 1.5 ProGoogle67.7%Mar 2026
39Step-3.5-Flash Base66.8%Feb 2026
40Claude 3 OpusAnthropic60.1%Mar 2026
41HRM-Text-1B56.5%May 2026
42Aria50.8%Oct 2024
43Apertus-70B-Instruct30.8%Sep 2025
44Chameleon 34B22.5%May 2024
45LLaMA-65B20.5%Feb 2023
46SmoLM2 (1.7B)11.6%Feb 2025

Source: hendrycks/math · MATH-500 subset, chain-of-thought.

§ 03 · Methodology

Frequently asked.

What is GSM8K?+

Grade School Math 8K — 8,500 word problems requiring 2-8 step arithmetic reasoning, created by OpenAI in 2021. Chain-of-thought prompting revealed a step-change in model capability. Now saturated at the frontier.

What makes MATH harder than GSM8K?+

MATH problems require domain knowledge (e.g., modular arithmetic, geometric proofs) not just arithmetic. Difficulty 5 problems (AIME-level) stump most people. The benchmark was designed to take years to saturate — reasoning models like o3 are now above 95%.

Why do reasoning models dominate math benchmarks?+

Reasoning models (o3, DeepSeek-R1) use extended chain-of-thought with self-verification before committing to an answer. This search-like inference process is especially effective on math, where step-by-step verification catches errors.

§ 04 · Related

Continue reading.

Math · all
All Math Benchmarks
GSM8K, MATH, AIME 2024, AMC 2023
Reasoning
Reasoning Benchmarks
GPQA Diamond, MMLU-Pro, HLE
Index
All LLM Benchmarks
Full leaderboard overview