Codesota · Models · GPT-4oOpenAI57 results · 46 benchmarks
Model card

GPT-4o.

OpenAIapiUndisclosed paramsMultimodal LLMProprietary10 current SOTA

Flagship GPT-4 class model with omni modality. Released May 2024.

§ 01 · Benchmarks

Every benchmark GPT-4o has a recorded score for.

#BenchmarkArea · TaskMetricValueRankDateSource
01videodb's-ocr-benchmark-public-collectionComputer Vision · Optical Character Recognitionaccuracy76.2%#1/52025-02-10source ↗
02SNLINatural Language Processing · Natural Language Inferenceaccuracy92.6%#1/82023-03-15source ↗
03SQuAD v2.0Natural Language Processing · Question Answeringf191.4%#1/222023-03-15source ↗
04BixBenchAgentic AI · Bioinformatics Agentsaccuracy17.0%#1/2source ↗
05Bugs2FixComputer Code · Bug Detectionaccuracy78.6%#1/6source ↗
06CodeSearchNetComputer Vision · Optical Character Recognitionbleu-425.3%#1/7source ↗
07CommonsenseQAReasoning · Commonsense Reasoningaccuracy85.4%#1/3source ↗
08HellaSwagReasoning · Commonsense Reasoningaccuracy95.3%#1/5source ↗
09HotpotQAReasoning · Multi-step Reasoningf171.3%#1/2source ↗
10LogiQAReasoning · Logical Reasoningaccuracy56.3%#1/2source ↗
11MAWPSReasoning · Arithmetic Reasoningaccuracy97.2%#1/3source ↗
12OmniDocBenchComputer Vision · Document Parsingocr-edit-distance0.0%#1/1source ↗
13ReClorReasoning · Logical Reasoningaccuracy72.4%#1/2source ↗
14SVAMPReasoning · Arithmetic Reasoningaccuracy93.7%#1/3source ↗
15StrategyQAReasoning · Multi-step Reasoningaccuracy82.1%#1/2source ↗
16WinoGrandeReasoning · Commonsense Reasoningaccuracy87.5%#1/3source ↗
17videodb's-ocr-benchmark-public-collectionComputer Vision · Optical Character Recognitionwer0.5%#2/52025-02-10source ↗
18TransCoder (GeeksForGeeks)Computer Code · Code Translationcomputational-accuracy88.2%#2/72024-06-17source ↗
19CNN/DailyMailNatural Language Processing · Text Summarizationrouge-l43.4%#2/62023-03-15source ↗
20CNN/DailyMailNatural Language Processing · Text Summarizationrouge-146.3%#2/62023-03-15source ↗
21CNN/DailyMailNatural Language Processing · Text Summarizationrouge-222.1%#2/32023-03-15source ↗
22SQuAD v2.0Natural Language Processing · Question Answeringem87.1%#2/22023-03-15source ↗
23CC-OCRComputer Vision · General OCR Capabilitiesmultilingual-f173.4%#2/8source ↗
24Defects4JComputer Code · Program Repaircorrect-patches82.0%#3/52024-04-18source ↗
25CoNLL-2003Natural Language Processing · Named Entity Recognitionf191.7%#3/72023-03-15source ↗
26SuperGLUENatural Language Processing · Text Classificationaverage-score90.3%#3/72023-03-15source ↗
27CC-OCRComputer Vision · General OCR Capabilitiesdocument-parsing53.3%#3/6source ↗
28KITAB-BenchComputer Vision · Optical Character Recognitioncer0.3%#3/14source ↗
29CrossCodeEvalComputer Code · Code Completionexact-match38.2%#4/62023-10-17source ↗
30CC-OCRComputer Vision · General OCR Capabilitiesmulti-scene-f176.4%#4/9source ↗
31CC-OCRComputer Vision · General OCR Capabilitieskie-f163.5%#4/5source ↗
32MME-VideoOCRComputer Vision · General OCR Capabilitiestotal-accuracy66.4%#4/6source ↗
33videodb's-ocr-benchmark-public-collectionComputer Vision · Optical Character Recognitioncer0.2%#5/52025-02-10source ↗
34MMBenchMultimodal · Visual Question Answeringaccuracy83.4%#5/82024-10-25source ↗
35olmOCR-BenchComputer Vision · Document Parsingold-scans40.7%#5/5source ↗
36VQA v2.0Multimodal · Visual Question Answeringaccuracy78.5%#6/72024-10-25source ↗
37OCRBench v2Computer Vision · General OCR Capabilitiesoverall-en-private55.5%#6/272024-05-13source ↗
38TextVQAMultimodal · Visual Question Answeringaccuracy77.4%#7/92024-10-25source ↗
39AIME 2024Reasoning · Mathematical Reasoningaccuracy13.4%#8/8source ↗
40ARC-ChallengeReasoning · Commonsense Reasoningaccuracy96.4%#8/10source ↗
41Tau2-BenchAgentic AI · Tool Usepass_rate36.0%#8/8unverified
42MMMUMultimodal · Visual Question Answeringaccuracy69.1%#11/182024-10-25source ↗
43MBPPComputer Code · Code Generationpass@187.8%#11/19source ↗
44HLEReasoning · Multi-step Reasoningaccuracy2.7%#13/13unverified
45HumanEvalComputer Code · Code Generationpass@191.0%#15/42source ↗
46HumanEvalComputer Code · Code Generationpass@190.2%#17/422024-05-01source ↗
47SWE-BenchComputer Code · Code Generationresolve-rate-agentic38.4%#19/252024-11-01unverified
48MMLU-ProReasoning · Commonsense Reasoningaccuracy72.6%#20/202026-04-20source ↗
49GSM8KReasoning · Mathematical Reasoningaccuracy92.0%#24/32source ↗
50LiveCodeBenchComputer Code · Code Generationpass@140.8%#25/302024-03-12source ↗
51MATHReasoning · Mathematical Reasoningaccuracy76.6%#26/34source ↗
52MMLUReasoning · Commonsense Reasoningaccuracy87.2%#27/41source ↗
53SWE-BenchComputer Code · Code Generationresolve-rate19.0%#28/322024-06-01source ↗
54GPQAReasoning · Multi-step Reasoningaccuracy49.9%#28/33source ↗
55OmniDocBenchComputer Vision · Document Parsingcomposite75.0%#29/33source ↗
56SWE-Bench VerifiedComputer Code · Code Generationresolve-rate41.2%#37/39source ↗
57SWE-bench VerifiedAgentic AI · SWE-benchresolve-rate33.2%#77/81source ↗
Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.
§ 02 · Strengths by area

Where GPT-4o actually performs.

Reasoning
17
benchmarks
avg rank #9.6 · 9 SOTA
Computer Vision
8
benchmarks
avg rank #5.0 · 1 SOTA
Natural Language Processing
5
benchmarks
avg rank #2.0
Multimodal
4
benchmarks
avg rank #7.3
Computer Code
9
benchmarks
avg rank #14.7
Agentic AI
3
benchmarks
avg rank #28.7
§ 03 · Papers

9 papers with results for GPT-4o.

  1. 2025-02-28· Agentic AI· 1 result

    BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology

    Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann et al.
  2. 2025-02-10· Computer Vision· 3 results

    Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

  3. 2024-10-25· Multimodal· 4 results

    SWE-bench Verified

  4. 2024-06-17· Computer Code· 1 result

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

  5. 2024-04-18· Computer Code· 1 result

    SRepair: Utilizing Multiple LLM Agents for Automated Program Repair

  6. 2024-03-12· Computer Code· 1 result

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

  7. 2023-10-17· Computer Code· 1 result

    CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

  8. 2023-10-10· Computer Code· 1 result

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao et al.
  9. 2023-03-15· Natural Language Processing· 8 results

    GPT-4 Technical Report

§ 04 · Related models

Other OpenAI models scored on Codesota.

o3
16 results · 5 SOTA
o4-mini
13 results · 3 SOTA
o3 (high)
2 results · 1 SOTA
o4-mini (high)
1 result · 1 SOTA
o1
11 results
GPT-5
8 results
o1-preview
Undisclosed params · 8 results
GPT-4.1
7 results
§ 05 · Sources & freshness

Where these numbers come from.

arxiv
17
results
openai-blog
7
results
alphaxiv-leaderboard
7
results
arxiv-paper
6
results
openai-simple-evals
4
results
papers-with-code
3
results
editorial
3
results
research-paper
1
result
cc-ocr-paper
1
result
github-readme
1
result
shadow-page-humaneval
1
result
agentless
1
result
artificial-analysis
1
result
official-leaderboard
1
result
sota-timeline
1
result
OmniDocBench GitHub
1
result
swebench-leaderboard
1
result
25 of 57 rows marked verified. · first result 2023-03-15, latest 2026-04-20.