400 evaluation tasks testing abstract visual reasoning. Created by François Chollet. Scores near human average (~85%) remained out of reach for LLMs until 2024.
Accuracy is the reported evaluation metric for ARC-AGI-1. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | o3 | verified | 87.5 | 2026 | Source ↗ | Looks wrong? |
| 02 | o3 (high) | verified | 87.5 | 2026 | Source ↗ | Looks wrong? |
| 03 | o4-mini | verified | 79 | 2026 | Source ↗ | Looks wrong? |
| 04 | Gemini 2.5 Pro | verified | 56.1 | 2026 | Source ↗ | Looks wrong? |
| 05 | Claude 3.7 Sonnet | verified | 30 | 2026 | Source ↗ | Looks wrong? |