Codesota · Models · GPT-4oOpenAI57 results · 45 benchmarks
Model card

GPT-4o.

OpenAIapiUndisclosed paramsMultimodal LLMProprietary10 current SOTA

Flagship GPT-4 class model with omni modality. Released May 2024.

§ 02 · Benchmarks

Every benchmark GPT-4o has a recorded score for.

#BenchmarkArea · TaskMetricValueRankDateSource
01videodb's-ocr-benchmark-public-collectionComputer Vision · Optical Character Recognitionaccuracy76.2%#1/52025-02-10source ↗
02SNLINatural Language Processing · Natural Language Inferenceaccuracy92.6%#1/82023-03-15source ↗
03BixBenchAgentic AI · Bioinformatics Agentsaccuracy17.0%#1/2source ↗
04Bugs2FixComputer Code · Bug Detectionaccuracy78.6%#1/6source ↗
05CodeSearchNetComputer Vision · Optical Character Recognitionbleu-425.3%#1/7source ↗
06CommonsenseQAReasoning · Commonsense Reasoningaccuracy85.4%#1/5source ↗
07HellaSwagReasoning · Commonsense Reasoningaccuracy95.3%#1/17source ↗
08HotpotQANatural Language Processing · Question Answeringf171.3%#1/2source ↗
09LogiQAReasoning · Logical Reasoningaccuracy56.3%#1/2source ↗
10MAWPSReasoning · Arithmetic Reasoningaccuracy97.2%#1/3source ↗
11OmniDocBenchComputer Vision · Document Parsingocr-edit-distance0.0%#1/1source ↗
12ReClorReasoning · Logical Reasoningaccuracy72.4%#1/2source ↗
13SVAMPReasoning · Arithmetic Reasoningaccuracy93.7%#1/3source ↗
14StrategyQAReasoning · Multi-step Reasoningaccuracy82.1%#1/2source ↗
15WinoGrandeReasoning · Commonsense Reasoningaccuracy87.5%#1/13source ↗
16videodb's-ocr-benchmark-public-collectionComputer Vision · Optical Character Recognitionwer0.5%#2/52025-02-10source ↗
17TransCoder (GeeksForGeeks)Computer Code · Code Translationcomputational-accuracy88.2%#2/72024-06-17source ↗
18CNN/DailyMailNatural Language Processing · Text Summarizationrouge-l43.4%#2/72023-03-15source ↗
19CNN/DailyMailNatural Language Processing · Text Summarizationrouge-222.1%#2/32023-03-15source ↗
20CNN/DailyMailNatural Language Processing · Text Summarizationrouge-146.3%#2/62023-03-15source ↗
21SQuAD v2.0Natural Language Processing · Question Answeringem87.1%#2/22023-03-15source ↗
22SQuAD v2.0Natural Language Processing · Question Answeringf191.4%#2/262023-03-15source ↗
23CC-OCRComputer Vision · General OCR Capabilitiesmultilingual-f173.4%#2/8source ↗
24Defects4JComputer Code · Program Repaircorrect-patches82.0%#3/52024-04-18source ↗
25CoNLL-2003Natural Language Processing · Named Entity Recognitionf191.7%#3/72023-03-15source ↗
26SuperGLUENatural Language Processing · Text classificationaverage-score90.3%#3/72023-03-15source ↗
27CC-OCRComputer Vision · General OCR Capabilitiesdocument-parsing53.3%#3/6source ↗
28KITAB-BenchComputer Vision · Optical Character Recognitioncer0.3%#3/14source ↗
29CrossCodeEvalComputer Code · Code Completionexact-match38.2%#4/62023-10-17source ↗
30CC-OCRComputer Vision · General OCR Capabilitieskie-f163.5%#4/5source ↗
31CC-OCRComputer Vision · General OCR Capabilitiesmulti-scene-f176.4%#4/9source ↗
32MME-VideoOCRComputer Vision · General OCR Capabilitiestotal-accuracy66.4%#4/6source ↗
33videodb's-ocr-benchmark-public-collectionComputer Vision · Optical Character Recognitioncer0.2%#5/52025-02-10source ↗
34olmOCR-BenchComputer Vision · Document Parsingold-scans40.7%#5/5source ↗
35OCRBench v2Computer Vision · General OCR Capabilitiesoverall-en-private55.5%#6/272024-05-13source ↗
36ARC-ChallengeReasoning · Commonsense Reasoningaccuracy96.4%#8/10source ↗
37Tau2-BenchAgentic AI · Tool Usepass_rate36.0%#8/8unverified
38VQA v2.0Multimodal · Visual Question Answeringaccuracy78.5%#11/162024-10-25source ↗
39AIME 2024Reasoning · Mathematical Reasoningaccuracy13.4%#11/11source ↗
40MBPPComputer Code · Code Generationpass@187.8%#11/19source ↗
41MMBenchMultimodal · Visual Question Answeringaccuracy83.4%#13/202024-10-25source ↗
42MMMUMultimodal · Visual Question Answeringaccuracy69.1%#14/302024-10-25source ↗
43HumanEvalComputer Code · Code Generationpass@191.0%#15/42source ↗
44TextVQAMultimodal · Visual Question Answeringaccuracy77.4%#17/232024-10-25source ↗
45HumanEvalComputer Code · Code Generationpass@190.2%#17/422024-05-01source ↗
46SWE-benchComputer Code · Code Generationresolve-rate-agentic38.4%#19/252024-11-01unverified
47LiveCodeBenchComputer Code · Code Generationpass@140.8%#25/302024-03-12source ↗
48SWE-benchComputer Code · Code Generationresolve-rate19.0%#28/322024-06-01source ↗
49MATHReasoning · Mathematical Reasoningaccuracy76.6%#29/46source ↗
50OmniDocBenchComputer Vision · Document Parsingcomposite75.0%#30/34source ↗
51GSM8KReasoning · Mathematical Reasoningaccuracy92.0%#31/48source ↗
52MMLUReasoning · Commonsense Reasoningaccuracy87.2%#33/64source ↗
53SWE-Bench VerifiedComputer Code · Code Generationresolve-rate41.2%#37/39source ↗
54GPQA DiamondReasoning · Multi-step Reasoningaccuracy49.9%#64/74source ↗
55HLEReasoning · Multi-step Reasoningaccuracy2.7%#73/74source ↗
56HLEReasoning · Multi-step Reasoningaccuracy2.7%#74/74unverified
57SWE-bench VerifiedAgentic AI · SWE-benchresolve-rate33.2%#77/81source ↗
Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.
§ 03 · Strengths by area

Where GPT-4o actually performs.

Reasoning
15
benchmarks
avg rank #20.7 · 8 SOTA
Natural Language Processing
5
benchmarks
avg rank #1.9 · 1 SOTA
Computer Vision
8
benchmarks
avg rank #5.1 · 1 SOTA
Natural Language Processing
1
benchmark
avg rank #3.0
Multimodal
4
benchmarks
avg rank #13.8
Computer Code
9
benchmarks
avg rank #14.7
Agentic AI
3
benchmarks
avg rank #28.7
§ 04 · Papers

9 papers with results for GPT-4o.

  1. 2025-02-28· Agentic AI· 1 result

    BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology

    Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann et al.
  2. 2025-02-10· Computer Vision· 3 results

    Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

  3. 2024-10-25· Multimodal· 4 results

    SWE-bench Verified

  4. 2024-06-17· Computer Code· 1 result

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

  5. 2024-04-18· Computer Code· 1 result

    SRepair: Utilizing Multiple LLM Agents for Automated Program Repair

  6. 2024-03-12· Computer Code· 1 result

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

  7. 2023-10-17· Computer Code· 1 result

    CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

  8. 2023-10-10· Computer Code· 1 result

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao et al.
  9. 2023-03-15· Natural Language Processing· 8 results

    GPT-4 Technical Report

§ 05 · Related models

Other OpenAI models scored on Codesota.

o3
17 results · 5 SOTA
o4-mini
14 results · 2 SOTA
o3 (high)
2 results · 1 SOTA
Codex / GPT-5.5
1 result · 1 SOTA
Codex CLI (GPT-5.5)
1 result · 1 SOTA
o4-mini (high)
1 result · 1 SOTA
o1
12 results
GPT-4.1
8 results
§ 06 · Sources & freshness

Where these numbers come from.

arxiv
17
results
openai-blog
7
results
alphaxiv-leaderboard
7
results
arxiv-paper
6
results
openai-simple-evals
4
results
papers-with-code
3
results
editorial
3
results
research-paper
1
result
cc-ocr-paper
1
result
github-readme
1
result
shadow-page-humaneval
1
result
agentless
1
result
official-leaderboard
1
result
sota-timeline
1
result
OmniDocBench GitHub
1
result
swebench-leaderboard
1
result
scale-hle-official
1
result
26 of 57 rows marked verified. · first result 2023-03-15, latest 2025-02-10.