Model card
GPT-4o.
OpenAIapiUndisclosed paramsMultimodal LLMProprietary10 current SOTA
Flagship GPT-4 class model with omni modality. Released May 2024.
§ 01 · Benchmarks
Every benchmark GPT-4o has a recorded score for.
| # | Benchmark | Area · Task | Metric | Value | Rank | Date | Source |
|---|---|---|---|---|---|---|---|
| 01 | videodb's-ocr-benchmark-public-collection | Computer Vision · Optical Character Recognition | accuracy | 76.2% | #1 | 2025-02-10 | source ↗ |
| 02 | SNLI | Natural Language Processing · Natural Language Inference | accuracy | 92.6% | #1 | 2023-03-15 | source ↗ |
| 03 | SQuAD v2.0 | Natural Language Processing · Question Answering | f1 | 91.4% | #1 | 2023-03-15 | source ↗ |
| 04 | BixBench | Agentic AI · Bioinformatics Agents | accuracy | 17.0% | #1 | — | source ↗ |
| 05 | Bugs2Fix | Computer Code · Bug Detection | accuracy | 78.6% | #1 | — | source ↗ |
| 06 | CodeSearchNet | Computer Vision · Optical Character Recognition | bleu-4 | 25.3% | #1 | — | source ↗ |
| 07 | CommonsenseQA | Reasoning · Commonsense Reasoning | accuracy | 85.4% | #1 | — | source ↗ |
| 08 | HellaSwag | Reasoning · Commonsense Reasoning | accuracy | 95.3% | #1 | — | source ↗ |
| 09 | HotpotQA | Reasoning · Multi-step Reasoning | f1 | 71.3% | #1 | — | source ↗ |
| 10 | LogiQA | Reasoning · Logical Reasoning | accuracy | 56.3% | #1 | — | source ↗ |
| 11 | MAWPS | Reasoning · Arithmetic Reasoning | accuracy | 97.2% | #1 | — | source ↗ |
| 12 | OmniDocBench | Computer Vision · Document Parsing | ocr-edit-distance | 0.0% | #1 | — | source ↗ |
| 13 | ReClor | Reasoning · Logical Reasoning | accuracy | 72.4% | #1 | — | source ↗ |
| 14 | SVAMP | Reasoning · Arithmetic Reasoning | accuracy | 93.7% | #1 | — | source ↗ |
| 15 | StrategyQA | Reasoning · Multi-step Reasoning | accuracy | 82.1% | #1 | — | source ↗ |
| 16 | WinoGrande | Reasoning · Commonsense Reasoning | accuracy | 87.5% | #1 | — | source ↗ |
| 17 | videodb's-ocr-benchmark-public-collection | Computer Vision · Optical Character Recognition | wer | 0.5% | #2 | 2025-02-10 | source ↗ |
| 18 | TransCoder (GeeksForGeeks) | Computer Code · Code Translation | computational-accuracy | 88.2% | #2 | 2024-06-17 | source ↗ |
| 19 | CNN/DailyMail | Natural Language Processing · Text Summarization | rouge-l | 43.4% | #2 | 2023-03-15 | source ↗ |
| 20 | CNN/DailyMail | Natural Language Processing · Text Summarization | rouge-1 | 46.3% | #2 | 2023-03-15 | source ↗ |
| 21 | CNN/DailyMail | Natural Language Processing · Text Summarization | rouge-2 | 22.1% | #2 | 2023-03-15 | source ↗ |
| 22 | SQuAD v2.0 | Natural Language Processing · Question Answering | em | 87.1% | #2 | 2023-03-15 | source ↗ |
| 23 | CC-OCR | Computer Vision · General OCR Capabilities | multilingual-f1 | 73.4% | #2 | — | source ↗ |
| 24 | Defects4J | Computer Code · Program Repair | correct-patches | 82.0% | #3 | 2024-04-18 | source ↗ |
| 25 | CoNLL-2003 | Natural Language Processing · Named Entity Recognition | f1 | 91.7% | #3 | 2023-03-15 | source ↗ |
| 26 | SuperGLUE | Natural Language Processing · Text Classification | average-score | 90.3% | #3 | 2023-03-15 | source ↗ |
| 27 | CC-OCR | Computer Vision · General OCR Capabilities | document-parsing | 53.3% | #3 | — | source ↗ |
| 28 | KITAB-Bench | Computer Vision · Optical Character Recognition | cer | 0.3% | #3 | — | source ↗ |
| 29 | CrossCodeEval | Computer Code · Code Completion | exact-match | 38.2% | #4 | 2023-10-17 | source ↗ |
| 30 | CC-OCR | Computer Vision · General OCR Capabilities | multi-scene-f1 | 76.4% | #4 | — | source ↗ |
| 31 | CC-OCR | Computer Vision · General OCR Capabilities | kie-f1 | 63.5% | #4 | — | source ↗ |
| 32 | MME-VideoOCR | Computer Vision · General OCR Capabilities | total-accuracy | 66.4% | #4 | — | source ↗ |
| 33 | videodb's-ocr-benchmark-public-collection | Computer Vision · Optical Character Recognition | cer | 0.2% | #5 | 2025-02-10 | source ↗ |
| 34 | MMBench | Multimodal · Visual Question Answering | accuracy | 83.4% | #5 | 2024-10-25 | source ↗ |
| 35 | olmOCR-Bench | Computer Vision · Document Parsing | old-scans | 40.7% | #5 | — | source ↗ |
| 36 | VQA v2.0 | Multimodal · Visual Question Answering | accuracy | 78.5% | #6 | 2024-10-25 | source ↗ |
| 37 | OCRBench v2 | Computer Vision · General OCR Capabilities | overall-en-private | 55.5% | #6 | 2024-05-13 | source ↗ |
| 38 | TextVQA | Multimodal · Visual Question Answering | accuracy | 77.4% | #7 | 2024-10-25 | source ↗ |
| 39 | AIME 2024 | Reasoning · Mathematical Reasoning | accuracy | 13.4% | #8 | — | source ↗ |
| 40 | ARC-Challenge | Reasoning · Commonsense Reasoning | accuracy | 96.4% | #8 | — | source ↗ |
| 41 | Tau2-Bench | Agentic AI · Tool Use | pass_rate | 36.0% | #8 | — | |
| 42 | MMMU | Multimodal · Visual Question Answering | accuracy | 69.1% | #11 | 2024-10-25 | source ↗ |
| 43 | MBPP | Computer Code · Code Generation | pass@1 | 87.8% | #11 | — | source ↗ |
| 44 | HLE | Reasoning · Multi-step Reasoning | accuracy | 2.7% | #13 | — | |
| 45 | HumanEval | Computer Code · Code Generation | pass@1 | 91.0% | #15 | — | source ↗ |
| 46 | HumanEval | Computer Code · Code Generation | pass@1 | 90.2% | #17 | 2024-05-01 | source ↗ |
| 47 | SWE-Bench | Computer Code · Code Generation | resolve-rate-agentic | 38.4% | #19 | 2024-11-01 | |
| 48 | MMLU-Pro | Reasoning · Commonsense Reasoning | accuracy | 72.6% | #20 | 2026-04-20 | source ↗ |
| 49 | GSM8K | Reasoning · Mathematical Reasoning | accuracy | 92.0% | #24 | — | source ↗ |
| 50 | LiveCodeBench | Computer Code · Code Generation | pass@1 | 40.8% | #25 | 2024-03-12 | source ↗ |
| 51 | MATH | Reasoning · Mathematical Reasoning | accuracy | 76.6% | #26 | — | source ↗ |
| 52 | MMLU | Reasoning · Commonsense Reasoning | accuracy | 87.2% | #27 | — | source ↗ |
| 53 | SWE-Bench | Computer Code · Code Generation | resolve-rate | 19.0% | #28 | 2024-06-01 | source ↗ |
| 54 | GPQA | Reasoning · Multi-step Reasoning | accuracy | 49.9% | #28 | — | source ↗ |
| 55 | OmniDocBench | Computer Vision · Document Parsing | composite | 75.0% | #29 | — | source ↗ |
| 56 | SWE-Bench Verified | Computer Code · Code Generation | resolve-rate | 41.2% | #37 | — | source ↗ |
| 57 | SWE-bench Verified | Agentic AI · SWE-bench | resolve-rate | 33.2% | #77 | — | source ↗ |
Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.
§ 02 · Strengths by area
Where GPT-4o actually performs.
§ 03 · Papers
9 papers with results for GPT-4o.
- 2025-02-28· Agentic AI· 1 result
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology
Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann et al. - 2025-02-10· Computer Vision· 3 results
Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments
- 2024-10-25· Multimodal· 4 results
SWE-bench Verified
- 2024-06-17· Computer Code· 1 result
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
- 2024-04-18· Computer Code· 1 result
SRepair: Utilizing Multiple LLM Agents for Automated Program Repair
- 2024-03-12· Computer Code· 1 result
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
- 2023-10-17· Computer Code· 1 result
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
- 2023-10-10· Computer Code· 1 result
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao et al. - 2023-03-15· Natural Language Processing· 8 results
GPT-4 Technical Report
§ 04 · Related models
Other OpenAI models scored on Codesota.
§ 05 · Sources & freshness
Where these numbers come from.
arxiv
17
results
openai-blog
7
results
alphaxiv-leaderboard
7
results
arxiv-paper
6
results
openai-simple-evals
4
results
papers-with-code
3
results
editorial
3
results
research-paper
1
result
cc-ocr-paper
1
result
github-readme
1
result
shadow-page-humaneval
1
result
agentless
1
result
artificial-analysis
1
result
official-leaderboard
1
result
sota-timeline
1
result
OmniDocBench GitHub
1
result
swebench-leaderboard
1
result
25 of 57 rows marked verified. · first result 2023-03-15, latest 2026-04-20.