Codesota · Models · o3OpenAI19 results · 18 benchmarks

Model card

o3.

OpenAIproprietary5 current SOTA

§ 02 · Benchmarks

Every benchmark o3 has a recorded score for.

#	Benchmark	Area · Task	Metric	Value	Rank	Date	Source
01	MMLU	Reasoning · Commonsense Reasoning	accuracy	92.9%	#1/64	2025-04-16	source ↗
02	RE-Bench	Agentic AI · RE-Bench	normalized-score	0.4%	#1/5	2025-04-01	source ↗
03	AIME 2024	Reasoning · Mathematical Reasoning	accuracy	96.7%	#1/11	—	source ↗
04	ARC-AGI-1	Reasoning · Logical Reasoning	accuracy	87.5%	#1/5	—	source ↗
05	ARC-Challenge	Reasoning · Commonsense Reasoning	accuracy	98.1%	#1/10	—	source ↗
06	HCAST	Agentic AI · HCAST	success-rate	49.0%	#2/6	2025-04-01	source ↗
07	METR Time Horizon	Agentic AI · Time Horizon	task-horizon-minutes	30.0%	#2/5	2025-04-01	source ↗
08	ARC-AGI-2	Reasoning · Logical Reasoning	accuracy	4.0%	#2/3	—	source ↗
09	GSM8K	Reasoning · Mathematical Reasoning	accuracy	99.0%	#4/48	—	source ↗
10	MATH	Reasoning · Mathematical Reasoning	accuracy	97.8%	#4/46	—	source ↗
11	HumanEval	Computer Code · Code Generation	pass@1	94.8%	#5/42	2025-04-01	source ↗
12	LiveCodeBench Pro	Computer Code · Code Generation	elo	1010.00	#10/10	—	source ↗
13	LiveCodeBench	Computer Code · Code Generation	pass@1	65.3%	#11/30	2024-03-12	source ↗
14	AIME 2025	Reasoning · Mathematical Reasoning	accuracy	86.7%	#12/22	—	source ↗
15	SWE-Bench Verified	Computer Code · Code Generation	resolve-rate	69.1%	#21/39	—	source ↗
16	HumanEval	Computer Code · Code Generation	pass@1	87.4%	#26/42	—	source ↗
17	GPQA Diamond	Reasoning · Multi-step Reasoning	accuracy	82.8%	#28/74	—	source ↗
18	HLE	Reasoning · Multi-step Reasoning	accuracy	20.3%	#36/74	—	source ↗
19	SWE-bench Verified	Agentic AI · SWE-bench	resolve-rate	69.1%	#44/81	—	source ↗

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.

§ 03 · Strengths by area

Where o3 actually performs.

Reasoning

benchmarks

avg rank #9.0 · 4 SOTA

Agentic AI

benchmarks

avg rank #12.3 · 1 SOTA

Computer Code

benchmarks

avg rank #14.6

§ 04 · Papers

2 papers with results for o3.

2025-04-01· Agentic AI· 3 results
METR: Measuring Autonomy in AI Systems (2025 Update)
2024-03-12· Computer Code· 1 result
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

§ 05 · Related models

Other OpenAI models scored on Codesota.

GPT-4o

Undisclosed params · 38 results · 9 SOTA

§ 06 · Sources & freshness

Where these numbers come from.

openai-simple-evals

results

official-leaderboard

results

openai-system-card

results

arcprize-leaderboard

results

arxiv

result

shadow-page-humaneval

result

livecodebench-pro-official

result

openai-blog

result

scale-hle-official

result

editorial

result

15 of 19 rows marked verified. · first result 2024-03-12, latest 2025-04-16.

o3.

Every benchmark o3 has a recorded score for.

Where o3 actually performs.

2 papers with results for o3.

METR: Measuring Autonomy in AI Systems (2025 Update)

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Other OpenAI models scored on Codesota.

Where these numbers come from.