Codesota · Models · o4-miniOpenAI16 results · 16 benchmarks

Model card

o4-mini.

OpenAIproprietary2 current SOTA

§ 02 · Benchmarks

Every benchmark o4-mini has a recorded score for.

#	Benchmark	Area · Task	Metric	Value	Rank	Date	Source
01	HumanEval	Computer Code · Code Generation	pass@1	97.3%	#1/42	—	source ↗
02	MBPP	Computer Code · Code Generation	pass@1	94.9%	#1/19	—	source ↗
03	AIME 2024	Reasoning · Mathematical Reasoning	accuracy	93.4%	#2/11	—	source ↗
04	ARC-AGI-1	Reasoning · Logical Reasoning	accuracy	79.0%	#3/5	—	source ↗
05	ARC-AGI-2	Reasoning · Logical Reasoning	accuracy	3.0%	#3/3	—	source ↗
06	ARC-Challenge	Reasoning · Commonsense Reasoning	accuracy	97.3%	#4/10	—	source ↗
07	GSM8K	Reasoning · Mathematical Reasoning	accuracy	99.0%	#4/48	—	source ↗
08	LiveCodeBench Pro	Computer Code · Code Generation	elo	2092.00	#4/10	—	source ↗
09	MATH	Reasoning · Mathematical Reasoning	accuracy	97.5%	#5/46	—	source ↗
10	LiveCodeBench	Computer Code · Code Generation	pass@1	72.8%	#7/30	2024-03-12	source ↗
11	AIME 2025	Reasoning · Mathematical Reasoning	accuracy	92.7%	#8/22	—	source ↗
12	MMLU	Reasoning · Commonsense Reasoning	accuracy	90.0%	#15/64	2025-04-16	source ↗
13	SWE-Bench Verified	Computer Code · Code Generation	resolve-rate	68.1%	#22/39	—	source ↗
14	GPQA Diamond	Reasoning · Multi-step Reasoning	accuracy	77.6%	#34/74	—	source ↗
15	HLE	Reasoning · Multi-step Reasoning	accuracy	18.1%	#42/74	—	source ↗
16	SWE-bench Verified	Agentic AI · SWE-bench	resolve-rate	68.1%	#45/81	—	source ↗

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.

§ 03 · Strengths by area

Where o4-mini actually performs.

Computer Code

benchmarks

avg rank #7.0 · 2 SOTA

§ 04 · Papers

1 paper with results for o4-mini.

2024-03-12· Computer Code· 1 result
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

§ 05 · Related models

Other OpenAI models scored on Codesota.

GPT-4o

Undisclosed params · 38 results · 9 SOTA

§ 06 · Sources & freshness

Where these numbers come from.

openai-simple-evals

results

openai-system-card

results

arcprize-leaderboard

results

official-model-card

result

livecodebench-pro-official

result

official-leaderboard

result

swebench-leaderboard

result

scale-hle-official

result

editorial

result

12 of 16 rows marked verified. · first result 2024-03-12, latest 2025-04-16.

o4-mini.

Every benchmark o4-mini has a recorded score for.

Where o4-mini actually performs.

1 paper with results for o4-mini.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Other OpenAI models scored on Codesota.

Where these numbers come from.