Codesota · Reasoning · Commonsense Reasoning · MMLU-ProTasks/Reasoning/Commonsense Reasoning
Commonsense Reasoning · benchmark dataset · 2024 · EN

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.

The MMLU-Pro dataset contains 12K complex questions across various disciplines, including biology, business, chemistry, computer science, economics, engineering, math, physics, and psychology. It has 10 options per question, compared to the original MMLU's 4, making it more challenging. It also integrates more reasoning-focused problems, where Chain-of-Thought (CoT) results can be significantly higher than Perplexity (PPL).

Paper Download datasetSubmit a result
§ 01 · Leaderboard

Best published scores.

No results indexed yet — be the first to submit a score.


Primary
accuracy · higher is better
No benchmark results indexed yet
§ 02 · Example problems

What a problem looks like.

Illustrative items from this benchmark, shown in the exact format the model sees. Sourced from primary distribution — see citation at the bottom of the section.

Multiple choiceMMLU-Pro · Abstract Algebra · ID 0
medium
math · abstract algebra

The symmetric group $S_n$ has $n!$ elements, hence it is not true that $S_{10}$ has 10 elements. Find the characteristic of the ring 2Z.

  1. A.0Correct
  2. B.30
  3. C.3
  4. D.10
  5. E.12
  6. F.50
  7. G.2
  8. H.100
  9. I.20
  10. J.5
EditorClassic abstract-algebra trap. The ring 2Z (even integers) has characteristic 0 because no positive n satisfies n·a = 0 for all a ∈ 2Z.
Multiple choiceMMLU-Pro · College Mathematics · ID 2
hard
math · college mathematics

Let A be the set of all ordered pairs of integers (m, n) such that 7m + 12n = 22. What is the greatest negative number in the set B = {m + n : (m, n) ∈ A}?

  1. A.-5
  2. B.0
  3. C.-3
  4. D.-7
  5. E.-4Correct
  6. F.-6
  7. G.-1
  8. H.-2
  9. I.-9
  10. J.N/A
EditorDiophantine parameterisation: general solution is (m, n) = (22 + 12t, -11 - 7t). m + n = 11 + 5t, so the greatest negative value occurs at t = -3 → -4.
Multiple choiceMMLU-Pro · Astronomy · ID 11
easy
physics · astronomy

Where do most short-period comets come from and how do we know?

  1. A.The Kuiper belt; short period comets tend to be in the plane of the solar system just like the Kuiper belt.Correct
  2. B.The asteroid belt; short period comets tend to come from random directions indicating a spherical distribution of comets called the asteroid belt.
  3. C.The asteroid belt; short period comets tend to be in the plane of the solar system just like the asteroid belt.
  4. D.The Oort cloud; short period comets tend to come from random directions indicating a spherical distribution of comets called the Oort cloud.
  5. E.The Oort cloud; short period comets tend to be in the plane of the solar system just like the Oort cloud.
  6. F.The Oort cloud; short period comets have orbital periods similar to asteroids.
  7. G.The asteroid belt; short period comets have orbital periods similar to asteroids.
  8. H.The Kuiper belt; short period comets tend to come from random directions indicating a spherical distribution of comets called the Kuiper belt.
  9. I.The Kuiper belt; short period comets have orbital periods similar to asteroids.
  10. J.The Oort cloud; short period comets tend to be in the plane of the solar system just like the asteroid belt.
EditorThe Kuiper belt is coplanar with the ecliptic; short-period comet inclinations cluster around the ecliptic, matching Kuiper-belt geometry rather than the spherical Oort cloud.
§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies
MMLU-Pro — Commonsense Reasoning | CodeSOTA