1,000 elementary-level math word problems testing robustness of arithmetic reasoning.
Accuracy is the reported evaluation metric for SVAMP. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Edit |
|---|---|---|---|---|---|---|
| 01 | gpt-4o | paper | 93.7 | 2025 | Source ↗ | Edit result |
| 02 | claude-35-sonnet | paper | 91.2 | 2025 | Source ↗ | Edit result |
| 03 | Claude 3.5 Sonnet | paper | 91.2 | 2025 | Source ↗ | Edit result |
| 04 | llama-3-70b | paper | 89.5 | 2025 | Source ↗ | Edit result |
| 05 | Llama 3 70B | unverified | 89.5 | 2025 | Source ↗ | Edit result |