Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Benchmark · ARC-ChallengeHome/Leaderboards/ARC-Challenge
Unknown

ARC-Challenge.

7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

accuracy

Accuracy is the reported evaluation metric for ARC-Challenge. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksEdit
01o3
0-shot. Source: OpenAI simple-evals (2025).
verified98.12026Source ↗Edit result
02Gemini 2.5 Pro
0-shot CoT. Source: Gemini 2.5 Pro technical report (April 2025).
verified97.82026Source ↗Edit result
03Llama-4-Maverick
0-shot. Source: Meta Llama 4 blog post (April 2025).
verified97.42026Source ↗Edit result
04o4-mini
0-shot. Source: OpenAI simple-evals (2025).
verified97.32026Source ↗Edit result
05DeepSeek R1
0-shot. Source: DeepSeek-R1 paper Table 3, arxiv:2501.12948 (Jan 2025).
verified97.12026Source ↗Edit result
06Llama 3.1 405B
Llama 3.1 405B Instruct. Official Meta model card evaluation.
verified96.92026Source ↗Edit result
07claude-35-sonnetpaper96.72025Source ↗Edit result
08Claude 3.5 Sonnetunverified96.72025Source ↗Edit result
09gpt-4o
Grade-school science questions (challenge set).
paper96.42025Source ↗Edit result
10Gemini 1.5 Prounverified94.82025Source ↗Edit result
11gemini-15-propaper94.82025Source ↗Edit result
12Llama 3 70Bunverified932025Source ↗Edit result
13llama-3-70bpaper932025Source ↗Edit result
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards