Codesota · Models · Claude 3.5 SonnetAnthropic32 results · 27 benchmarks
Model card

Claude 3.5 Sonnet.

AnthropicapiUndisclosed paramsMultimodal LLMProprietary

Anthropic Claude 3.5 Sonnet, released June 2024.

§ 01 · Benchmarks

Every benchmark Claude 3.5 Sonnet has a recorded score for.

#BenchmarkArea · TaskMetricValueRankDateSource
01BIG-Bench HardReasoning · Multi-step Reasoningaccuracy93.1%#1/5source ↗
02BixBenchAgentic AI · Bioinformatics Agentsaccuracy17.0%#1/2source ↗
03CommonsenseQAReasoning · Commonsense Reasoningaccuracy83.2%#2/3source ↗
04HotpotQAReasoning · Multi-step Reasoningf168.5%#2/2source ↗
05LogiQAReasoning · Logical Reasoningaccuracy53.8%#2/2source ↗
06MAWPSReasoning · Arithmetic Reasoningaccuracy95.8%#2/3source ↗
07ReClorReasoning · Logical Reasoningaccuracy68.9%#2/2source ↗
08SVAMPReasoning · Arithmetic Reasoningaccuracy91.2%#2/3source ↗
09StrategyQAReasoning · Multi-step Reasoningaccuracy79.8%#2/2source ↗
10WinoGrandeReasoning · Commonsense Reasoningaccuracy85.4%#2/3source ↗
11CC-OCRComputer Vision · General OCR Capabilitieskie-f164.6%#3/5source ↗
12HellaSwagReasoning · Commonsense Reasoningaccuracy89.0%#3/5source ↗
13RE-BenchAgentic AI · RE-Benchnormalized-score0.1%#4/52024-11-22source ↗
14SNLINatural Language Processing · Natural Language Inferenceaccuracy91.8%#4/82024-06-20source ↗
15SQuAD v2.0Natural Language Processing · Question Answeringf190.2%#4/222024-06-20source ↗
16CC-OCRComputer Vision · General OCR Capabilitiesdocument-parsing47.8%#4/6source ↗
17CC-OCRComputer Vision · General OCR Capabilitiesmultilingual-f165.7%#4/8source ↗
18HCASTAgentic AI · HCASTsuccess-rate18.0%#5/62025-04-01source ↗
19CC-OCRComputer Vision · General OCR Capabilitiesmulti-scene-f172.9%#5/9source ↗
20ARC-ChallengeReasoning · Commonsense Reasoningaccuracy96.7%#7/10source ↗
21MBPPComputer Code · Code Generationpass@189.2%#10/19source ↗
22MMMUMultimodal · Visual Question Answeringaccuracy68.3%#12/182024-10-22source ↗
23HumanEvalComputer Code · Code Generationpass@192.0%#14/42source ↗
24GSM8KReasoning · Mathematical Reasoningaccuracy96.4%#17/32source ↗
25SWE-BenchComputer Code · Code Generationresolve-rate-agentic49.0%#18/252024-12-01unverified
26GSM8KReasoning · Mathematical Reasoningaccuracy95.0%#20/322024-07-01source ↗
27MMLUReasoning · Commonsense Reasoningaccuracy88.3%#23/41source ↗
28GPQAReasoning · Multi-step Reasoningaccuracy59.4%#24/33source ↗
29SWE-BenchComputer Code · Code Generationresolve-rate27.0%#27/322024-08-01source ↗
30MATHReasoning · Mathematical Reasoningaccuracy71.1%#30/34source ↗
31SWE-Bench VerifiedComputer Code · Code Generationresolve-rate50.8%#32/39source ↗
32SWE-bench VerifiedAgentic AI · SWE-benchresolve-rate49.0%#67/81source ↗
Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.
§ 02 · Strengths by area

Where Claude 3.5 Sonnet actually performs.

Computer Vision
1
benchmark
avg rank #4.0
Natural Language Processing
2
benchmarks
avg rank #4.0
Reasoning
15
benchmarks
avg rank #8.8
Multimodal
1
benchmark
avg rank #12.0
Agentic AI
4
benchmarks
avg rank #19.3
Computer Code
4
benchmarks
avg rank #20.2
§ 03 · Papers

6 papers with results for Claude 3.5 Sonnet.

  1. 2025-04-01· Agentic AI· 1 result

    METR: Measuring Autonomy in AI Systems (2025 Update)

  2. 2025-02-28· Agentic AI· 1 result

    BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology

    Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann et al.
  3. 2024-11-22· Agentic AI· 1 result

    RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts

  4. 2024-10-22· Multimodal· 1 result

    Claude 3.5 Sonnet Model Card

  5. 2024-06-20· Natural Language Processing· 2 results

    Claude 3.5 Sonnet Model Card

  6. 2023-10-10· Computer Code· 1 result

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao et al.
§ 04 · Related models

Other Anthropic models scored on Codesota.

Claude Opus 4
Undisclosed params · 13 results · 2 SOTA
Claude Opus 4.5
3 results · 2 SOTA
Claude Sonnet 5
Undisclosed params · 2 results · 2 SOTA
Claude Sonnet 4
10 results · 1 SOTA
Claude Mythos Preview
1 result · 1 SOTA
Claude Opus 4.5
Undisclosed params · 13 results
Claude 3.7 Sonnet
10 results
Claude 3 Opus
5 results
§ 05 · Sources & freshness

Where these numbers come from.

anthropic-blog
7
results
arxiv-paper
6
results
arxiv
4
results
openai-simple-evals
4
results
alphaxiv-leaderboard
2
results
cc-ocr-paper
2
results
llm-stats-bbh
1
result
research-paper
1
result
official-leaderboard
1
result
anthropic-internal
1
result
gsm8k-shadow-page
1
result
sota-timeline
1
result
editorial
1
result
10 of 32 rows marked verified. · first result 2024-06-20, latest 2025-04-01.