o3-mini.

OpenAIapi

§ 01 · Benchmarks

Every benchmark o3-mini has a recorded score for.

#	Benchmark	Area · Task	Metric	Value	Rank	Date	Source
01	HumanEval	Computer Code · Code Generation	pass@1	96.3%	#2/42	—	source ↗
02	MBPP	Computer Code · Code Generation	pass@1	93.3%	#2/19	—	source ↗
03	MATH	Reasoning · Mathematical Reasoning	accuracy	97.9%	#3/34	—	source ↗
04	LiveCodeBench	Computer Code · Code Generation	pass@1	66.9%	#9/30	2024-03-12	source ↗
05	GPQA	Reasoning · Multi-step Reasoning	accuracy	74.9%	#13/33	—	source ↗
06	SWE-Bench Verified	Computer Code · Code Generation	resolve-rate	55.8%	#30/39	—	source ↗
07	MMLU	Reasoning · Commonsense Reasoning	accuracy	85.9%	#35/41	—	source ↗
08	SWE-bench Verified	Agentic AI · SWE-bench	resolve-rate	49.3%	#66/81	—	source ↗

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.

§ 02 · Strengths by area