Who leads the ARC-Challenge benchmark?

o3 currently leads ARC-Challenge with a score of 98.1 on accuracy.

What is the state-of-the-art score on ARC-Challenge?

The state-of-the-art result on ARC-Challenge is 98.1 (accuracy), achieved by o3 as of 2026.

How many models are tracked on ARC-Challenge?

Codesota tracks 13 models on ARC-Challenge.

When was the ARC-Challenge leaderboard last updated?

The ARC-Challenge leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2025.

Codesota · Benchmark · ARC-ChallengeHome/Leaderboards/ARC-Challenge

Unknown

ARC-Challenge.

Name: ARC-Challenge Benchmark Results
Creator: Unknown
Published: 2025-01-01
License: https://creativecommons.org/licenses/by/4.0/

7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

accuracy

Accuracy is the reported evaluation metric for ARC-Challenge. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	o3 0-shot. Source: OpenAI simple-evals (2025).	verified	98.1	2026	Source ↗	Looks wrong?
02	Gemini 2.5 Pro 0-shot CoT. Source: Gemini 2.5 Pro technical report (April 2025).	verified	97.8	2026	Source ↗	Looks wrong?
03	Llama 4 Maverick 0-shot. Source: Meta Llama 4 blog post (April 2025).	verified	97.4	2026	Source ↗	Looks wrong?
04	o4-mini 0-shot. Source: OpenAI simple-evals (2025).	verified	97.3	2026	Source ↗	Looks wrong?
05	DeepSeek R1 0-shot. Source: DeepSeek-R1 paper Table 3, arxiv:2501.12948 (Jan 2025).	verified	97.1	2026	Source ↗	Looks wrong?
06	Llama 3.1 405B Llama 3.1 405B Instruct. Official Meta model card evaluation.	verified	96.9	2026	Source ↗	Looks wrong?
07	claude-35-sonnet	paper	96.7	2025	Source ↗	Looks wrong?
08	Claude 3.5 Sonnet	unverified	96.7	2025	Source ↗	Looks wrong?
09	gpt-4o Grade-school science questions (challenge set).	paper	96.4	2025	Source ↗	Looks wrong?
10	Gemini 1.5 Pro	unverified	94.8	2025	Source ↗	Looks wrong?
11	gemini-15-pro	paper	94.8	2025	Source ↗	Looks wrong?
12	Llama 3 70B	unverified	93	2025	Source ↗	Looks wrong?
13	llama-3-70b	paper	93	2025	Source ↗	Looks wrong?

§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards