Codesota · Models · DeepSeek R1DeepSeek19 results · 13 benchmarks
Model card

DeepSeek R1.

DeepSeekopen-source671B MoE params
§ 02 · Benchmarks

Every benchmark DeepSeek R1 has a recorded score for.

#BenchmarkArea · TaskMetricValueRankDateSource
01ARC-ChallengeReasoning · Commonsense Reasoningaccuracy97.1%#5/10source ↗
02MATHReasoning · Mathematical Reasoningaccuracy97.3%#6/46source ↗
03MMLUReasoning · Commonsense Reasoningaccuracy90.8%#8/642025-01-22source ↗
04AIME 2024Reasoning · Mathematical Reasoningaccuracy79.8%#9/11source ↗
05LiveCodeBench ProComputer Code · Code Generationelo1161.00#9/10source ↗
06LiveCodeBenchComputer Code · Code Generationpass@165.9%#10/30source ↗
07SWE-benchComputer Code · Code Generationresolve-rate76.3%#13/322025-12-01source ↗
08GSM8KReasoning · Mathematical Reasoningaccuracy97.3%#17/48source ↗
09AIME 2025Reasoning · Mathematical Reasoningaccuracy72.0%#19/22source ↗
10SWE-Bench VerifiedComputer Code · Code Generationresolve-rate49.2%#33/39source ↗
11PLCCNatural Language Processing · Polish Cultural Competencygrammar74.0%#34/165source ↗
12PLCCNatural Language Processing · Polish Cultural Competencyvocabulary72.0%#39/165source ↗
13PLCCNatural Language Processing · Polish Cultural Competencyhistory85.0%#40/165source ↗
14PLCCNatural Language Processing · Polish Cultural Competencyaverage76.0%#45/165source ↗
15PLCCNatural Language Processing · Polish Cultural Competencygeography84.0%#45/165source ↗
16GPQA DiamondReasoning · Multi-step Reasoningaccuracy71.5%#46/74source ↗
17PLCCNatural Language Processing · Polish Cultural Competencyart-and-entertainment66.0%#47/165source ↗
18PLCCNatural Language Processing · Polish Cultural Competencyculture-and-tradition75.0%#53/165source ↗
19HLEReasoning · Multi-step Reasoningaccuracy8.5%#55/74unverified
Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.
§ 03 · Strengths by area

Where DeepSeek R1 actually performs.

Computer Code
4
benchmarks
avg rank #16.3
Reasoning
8
benchmarks
avg rank #20.6
Natural Language Processing
1
benchmark
avg rank #43.3
§ 04 · Papers

1 paper with results for DeepSeek R1.

  1. 2023-10-10· Computer Code· 1 result

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao et al.
§ 05 · Related models

Other DeepSeek models scored on Codesota.

DeepSeek-V4-Pro Max
4 results · 1 SOTA
DeepSeek-V3
7 results
DeepSeek-V3.2
6 results
DeepSeek-Coder-V2-Instruct
Unknown params · 4 results
DeepSeek-OCR
4 results
DeepSeek-V4-Flash Max
4 results
DeepSeek-V3.2-Speciale
3 results
DeepSeek V3.5
685B MoE params · 2 results
§ 06 · Sources & freshness

Where these numbers come from.

sdadas/PLCC
7
results
arxiv
6
results
swebench-leaderboard
2
results
deepseek-paper
1
result
livecodebench-pro-official
1
result
arxiv-2501.12948
1
result
editorial
1
result
16 of 19 rows marked verified. · first result 2025-01-22, latest 2025-12-01.