Codesota · Reasoning · Commonsense Reasoning · MMLU-ProTasks/Reasoning/Commonsense Reasoning
Commonsense Reasoning · benchmark dataset · 2024 · EN

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.

Harder version of MMLU with 10-choice multiple-choice questions across 57 subjects and 12,000 questions. Reduces sensitivity to prompt format and increases reasoning difficulty.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

20 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
20 rows
#ModelOrgSubmittedPaper / codeaccuracy
01Gemini 3.1 ProAPIGoogleApr 2026pricepertoken90.99
02Gemini 3 ProAPIGoogleApr 2026pricepertoken89.80
03Claude Opus 4.5APIAnthropicApr 2026pricepertoken89.50
04Gemini 3 FlashAPIGoogleApr 2026pricepertoken89
05Qwen3.6 PlusAlibaba CloudApr 2026llm-stats88.50
06Claude Opus 4.1AnthropicApr 2026pricepertoken88
07MiniMax M2.1APIMiniMaxApr 2026pricepertoken88
08Qwen3.5-397B-A17BAlibaba CloudApr 2026llm-stats87.80
09Claude Sonnet 4.5APIAnthropicApr 2026pricepertoken87.50
10GPT-5.2APIOpenAIApr 2026pricepertoken87.40
11Kimi K2.5APIMoonshot AIApr 2026llm-stats87.10
12GPT-5APIOpenAIApr 2026pricepertoken87.10
13GPT-5.1APIOpenAIApr 2026pricepertoken87
14Grok 4APIxAIApr 2026pricepertoken86.60
15DeepSeek V3.2APIDeepSeekApr 2026pricepertoken86.20
16Claude 3.7 SonnetAnthropicApr 2026anthropic-announcement85.10
17DeepSeek-R1-0528OSSDeepSeekApr 2026llm-stats85
18Kimi K2-Thinking-0905OSSMoonshot AIApr 2026llm-stats84.60
19GLM-4.5Zhipu AIApr 2026llm-stats84.60
20GPT-4oAPIOpenAIApr 2026artificial-analysis72.60
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 02 · Example problems

What a problem looks like.

Illustrative items from this benchmark, shown in the exact format the model sees. Sourced from primary distribution — see citation at the bottom of the section.

Multiple choiceMMLU-Pro · Abstract Algebra · ID 0
medium
math · abstract algebra

The symmetric group $S_n$ has $n!$ elements, hence it is not true that $S_{10}$ has 10 elements. Find the characteristic of the ring 2Z.

  1. A.0Correct
  2. B.30
  3. C.3
  4. D.10
  5. E.12
  6. F.50
  7. G.2
  8. H.100
  9. I.20
  10. J.5
EditorClassic abstract-algebra trap. The ring 2Z (even integers) has characteristic 0 because no positive n satisfies n·a = 0 for all a ∈ 2Z.
Multiple choiceMMLU-Pro · College Mathematics · ID 2
hard
math · college mathematics

Let A be the set of all ordered pairs of integers (m, n) such that 7m + 12n = 22. What is the greatest negative number in the set B = {m + n : (m, n) ∈ A}?

  1. A.-5
  2. B.0
  3. C.-3
  4. D.-7
  5. E.-4Correct
  6. F.-6
  7. G.-1
  8. H.-2
  9. I.-9
  10. J.N/A
EditorDiophantine parameterisation: general solution is (m, n) = (22 + 12t, -11 - 7t). m + n = 11 + 5t, so the greatest negative value occurs at t = -3 → -4.
Multiple choiceMMLU-Pro · Astronomy · ID 11
easy
physics · astronomy

Where do most short-period comets come from and how do we know?

  1. A.The Kuiper belt; short period comets tend to be in the plane of the solar system just like the Kuiper belt.Correct
  2. B.The asteroid belt; short period comets tend to come from random directions indicating a spherical distribution of comets called the asteroid belt.
  3. C.The asteroid belt; short period comets tend to be in the plane of the solar system just like the asteroid belt.
  4. D.The Oort cloud; short period comets tend to come from random directions indicating a spherical distribution of comets called the Oort cloud.
  5. E.The Oort cloud; short period comets tend to be in the plane of the solar system just like the Oort cloud.
  6. F.The Oort cloud; short period comets have orbital periods similar to asteroids.
  7. G.The asteroid belt; short period comets have orbital periods similar to asteroids.
  8. H.The Kuiper belt; short period comets tend to come from random directions indicating a spherical distribution of comets called the Kuiper belt.
  9. I.The Kuiper belt; short period comets have orbital periods similar to asteroids.
  10. J.The Oort cloud; short period comets tend to be in the plane of the solar system just like the asteroid belt.
EditorThe Kuiper belt is coplanar with the ecliptic; short-period comet inclinations cluster around the ecliptic, matching Kuiper-belt geometry rather than the spherical Oort cloud.
§ 03 · Progress

1 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. Apr 20, 2026Gemini 3.1 ProGoogle90.99
Fig 3 · SOTA-setting models only. 1 entries span Apr 2026 Apr 2026.
§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies