Codesota · Models · GPT-4oOpenAI57 results · 46 benchmarks

Model card

GPT-4o.

OpenAIapiUndisclosed paramsMultimodal LLMProprietary10 current SOTA

Flagship GPT-4 class model with omni modality. Released May 2024.

§ 01 · Benchmarks

Every benchmark GPT-4o has a recorded score for.

#	Benchmark	Area · Task	Metric	Value	Rank	Date	Source
01	videodb's-ocr-benchmark-public-collection	Computer Vision · Optical Character Recognition	accuracy	76.2%	#1/5	2025-02-10	source ↗
02	SNLI	Natural Language Processing · Natural Language Inference	accuracy	92.6%	#1/8	2023-03-15	source ↗
03	SQuAD v2.0	Natural Language Processing · Question Answering	f1	91.4%	#1/22	2023-03-15	source ↗
04	BixBench	Agentic AI · Bioinformatics Agents	accuracy	17.0%	#1/2	—	source ↗
05	Bugs2Fix	Computer Code · Bug Detection	accuracy	78.6%	#1/6	—	source ↗
06	CodeSearchNet	Computer Vision · Optical Character Recognition	bleu-4	25.3%	#1/7	—	source ↗
07	CommonsenseQA	Reasoning · Commonsense Reasoning	accuracy	85.4%	#1/3	—	source ↗
08	HellaSwag	Reasoning · Commonsense Reasoning	accuracy	95.3%	#1/5	—	source ↗
09	HotpotQA	Reasoning · Multi-step Reasoning	f1	71.3%	#1/2	—	source ↗
10	LogiQA	Reasoning · Logical Reasoning	accuracy	56.3%	#1/2	—	source ↗
11	MAWPS	Reasoning · Arithmetic Reasoning	accuracy	97.2%	#1/3	—	source ↗
12	OmniDocBench	Computer Vision · Document Parsing	ocr-edit-distance	0.0%	#1/1	—	source ↗
13	ReClor	Reasoning · Logical Reasoning	accuracy	72.4%	#1/2	—	source ↗
14	SVAMP	Reasoning · Arithmetic Reasoning	accuracy	93.7%	#1/3	—	source ↗
15	StrategyQA	Reasoning · Multi-step Reasoning	accuracy	82.1%	#1/2	—	source ↗
16	WinoGrande	Reasoning · Commonsense Reasoning	accuracy	87.5%	#1/3	—	source ↗
17	videodb's-ocr-benchmark-public-collection	Computer Vision · Optical Character Recognition	wer	0.5%	#2/5	2025-02-10	source ↗
18	TransCoder (GeeksForGeeks)	Computer Code · Code Translation	computational-accuracy	88.2%	#2/7	2024-06-17	source ↗
19	CNN/DailyMail	Natural Language Processing · Text Summarization	rouge-l	43.4%	#2/6	2023-03-15	source ↗
20	CNN/DailyMail	Natural Language Processing · Text Summarization	rouge-1	46.3%	#2/6	2023-03-15	source ↗
21	CNN/DailyMail	Natural Language Processing · Text Summarization	rouge-2	22.1%	#2/3	2023-03-15	source ↗
22	SQuAD v2.0	Natural Language Processing · Question Answering	em	87.1%	#2/2	2023-03-15	source ↗
23	CC-OCR	Computer Vision · General OCR Capabilities	multilingual-f1	73.4%	#2/8	—	source ↗
24	Defects4J	Computer Code · Program Repair	correct-patches	82.0%	#3/5	2024-04-18	source ↗
25	CoNLL-2003	Natural Language Processing · Named Entity Recognition	f1	91.7%	#3/7	2023-03-15	source ↗
26	SuperGLUE	Natural Language Processing · Text Classification	average-score	90.3%	#3/7	2023-03-15	source ↗
27	CC-OCR	Computer Vision · General OCR Capabilities	document-parsing	53.3%	#3/6	—	source ↗
28	KITAB-Bench	Computer Vision · Optical Character Recognition	cer	0.3%	#3/14	—	source ↗
29	CrossCodeEval	Computer Code · Code Completion	exact-match	38.2%	#4/6	2023-10-17	source ↗
30	CC-OCR	Computer Vision · General OCR Capabilities	multi-scene-f1	76.4%	#4/9	—	source ↗
31	CC-OCR	Computer Vision · General OCR Capabilities	kie-f1	63.5%	#4/5	—	source ↗
32	MME-VideoOCR	Computer Vision · General OCR Capabilities	total-accuracy	66.4%	#4/6	—	source ↗
33	videodb's-ocr-benchmark-public-collection	Computer Vision · Optical Character Recognition	cer	0.2%	#5/5	2025-02-10	source ↗
34	MMBench	Multimodal · Visual Question Answering	accuracy	83.4%	#5/8	2024-10-25	source ↗
35	olmOCR-Bench	Computer Vision · Document Parsing	old-scans	40.7%	#5/5	—	source ↗
36	VQA v2.0	Multimodal · Visual Question Answering	accuracy	78.5%	#6/7	2024-10-25	source ↗
37	OCRBench v2	Computer Vision · General OCR Capabilities	overall-en-private	55.5%	#6/27	2024-05-13	source ↗
38	TextVQA	Multimodal · Visual Question Answering	accuracy	77.4%	#7/9	2024-10-25	source ↗
39	AIME 2024	Reasoning · Mathematical Reasoning	accuracy	13.4%	#8/8	—	source ↗
40	ARC-Challenge	Reasoning · Commonsense Reasoning	accuracy	96.4%	#8/10	—	source ↗
41	Tau2-Bench	Agentic AI · Tool Use	pass_rate	36.0%	#8/8	—	unverified
42	MMMU	Multimodal · Visual Question Answering	accuracy	69.1%	#11/18	2024-10-25	source ↗
43	MBPP	Computer Code · Code Generation	pass@1	87.8%	#11/19	—	source ↗
44	HLE	Reasoning · Multi-step Reasoning	accuracy	2.7%	#13/13	—	unverified
45	HumanEval	Computer Code · Code Generation	pass@1	91.0%	#15/42	—	source ↗
46	HumanEval	Computer Code · Code Generation	pass@1	90.2%	#17/42	2024-05-01	source ↗
47	SWE-Bench	Computer Code · Code Generation	resolve-rate-agentic	38.4%	#19/25	2024-11-01	unverified
48	MMLU-Pro	Reasoning · Commonsense Reasoning	accuracy	72.6%	#20/20	2026-04-20	source ↗
49	GSM8K	Reasoning · Mathematical Reasoning	accuracy	92.0%	#24/32	—	source ↗
50	LiveCodeBench	Computer Code · Code Generation	pass@1	40.8%	#25/30	2024-03-12	source ↗
51	MATH	Reasoning · Mathematical Reasoning	accuracy	76.6%	#26/34	—	source ↗
52	MMLU	Reasoning · Commonsense Reasoning	accuracy	87.2%	#27/41	—	source ↗
53	SWE-Bench	Computer Code · Code Generation	resolve-rate	19.0%	#28/32	2024-06-01	source ↗
54	GPQA	Reasoning · Multi-step Reasoning	accuracy	49.9%	#28/33	—	source ↗
55	OmniDocBench	Computer Vision · Document Parsing	composite	75.0%	#29/33	—	source ↗
56	SWE-Bench Verified	Computer Code · Code Generation	resolve-rate	41.2%	#37/39	—	source ↗
57	SWE-bench Verified	Agentic AI · SWE-bench	resolve-rate	33.2%	#77/81	—	source ↗

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.

§ 02 · Strengths by area

Where GPT-4o actually performs.

Reasoning

benchmarks

avg rank #9.6 · 9 SOTA

Computer Vision

benchmarks

avg rank #5.0 · 1 SOTA

Natural Language Processing

§ 03 · Papers

9 papers with results for GPT-4o.

2025-02-28· Agentic AI· 1 result
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology
Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann et al.
2025-02-10· Computer Vision· 3 results
Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments
2024-10-25· Multimodal· 4 results
SWE-bench Verified
2024-06-17· Computer Code· 1 result
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
2024-04-18· Computer Code· 1 result
SRepair: Utilizing Multiple LLM Agents for Automated Program Repair
2024-03-12· Computer Code· 1 result
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
2023-10-17· Computer Code· 1 result
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
2023-10-10· Computer Code· 1 result
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao et al.
2023-03-15· Natural Language Processing· 8 results
GPT-4 Technical Report

§ 04 · Related models

Other OpenAI models scored on Codesota.

Undisclosed params · 8 results

GPT-4.1

7 results

§ 05 · Sources & freshness

Where these numbers come from.

arxiv

results

openai-blog

results

alphaxiv-leaderboard

results

arxiv-paper

results

openai-simple-evals

results

papers-with-code

results

editorial

results

research-paper

result

cc-ocr-paper

result

github-readme

result

shadow-page-humaneval

result

agentless

result

artificial-analysis

result

official-leaderboard

result

sota-timeline

result

OmniDocBench GitHub

result

swebench-leaderboard

result

25 of 57 rows marked verified. · first result 2023-03-15, latest 2026-04-20.