MATH is a benchmark dataset of challenging competition-level mathematics problems introduced by Hendrycks et al. (NeurIPS Datasets & Benchmarks / arXiv 2103.03874). The dataset contains about 12,500 problems drawn from math competitions and is annotated with full step-by-step solutions (expressed in LaTeX and natural language) and final answers. Problems are organized by subject (e.g., algebra, counting & probability, geometry, number theory, precalculus) and difficulty level and are commonly distributed as a ~12,000-example training set plus a 500-example test set in public conversions. MATH is intended to evaluate and train models on mathematical problem solving and derivation generation (reasoning) and has been widely used as a benchmark for LLM math reasoning.
Accuracy is the reported evaluation metric for MATH. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Source |
|---|---|---|---|---|---|
| 01 | Qwen2.5-Plus | paper | 84.7 | N/A | Source ↗ |