7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.
Accuracy is the reported evaluation metric for ARC-Challenge. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Edit |
|---|---|---|---|---|---|---|
| 01 | o3 | verified | 98.1 | 2026 | Source ↗ | Edit result |
| 02 | Gemini 2.5 Pro | verified | 97.8 | 2026 | Source ↗ | Edit result |
| 03 | Llama-4-Maverick | verified | 97.4 | 2026 | Source ↗ | Edit result |
| 04 | o4-mini | verified | 97.3 | 2026 | Source ↗ | Edit result |
| 05 | DeepSeek R1 | verified | 97.1 | 2026 | Source ↗ | Edit result |
| 06 | Llama 3.1 405B | verified | 96.9 | 2026 | Source ↗ | Edit result |
| 07 | claude-35-sonnet | paper | 96.7 | 2025 | Source ↗ | Edit result |
| 08 | Claude 3.5 Sonnet | unverified | 96.7 | 2025 | Source ↗ | Edit result |
| 09 | gpt-4o | paper | 96.4 | 2025 | Source ↗ | Edit result |
| 10 | Gemini 1.5 Pro | unverified | 94.8 | 2025 | Source ↗ | Edit result |
| 11 | gemini-15-pro | paper | 94.8 | 2025 | Source ↗ | Edit result |
| 12 | Llama 3 70B | unverified | 93 | 2025 | Source ↗ | Edit result |
| 13 | llama-3-70b | paper | 93 | 2025 | Source ↗ | Edit result |