Codesota · Computer Code · Code Generation · LiveCodeBenchTasks/Computer Code/Code Generation
Code Generation · benchmark dataset · 2024 · EN

LiveCodeBench.

Contamination-free coding benchmark collecting new problems from LeetCode, AtCoder, and CodeForces after model knowledge cutoffs. Updated continuously with fresh problems. Primary metric is pass@1 on the full test set.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

30 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
pass@1 · higher is better
pass@1· primary
30 rows
#ModelOrgSubmittedPaper / codepass@1
01Gemini 3 Pro PreviewMar 2026vendor91.70
02Gemini 3 FlashAPIGoogleMar 2026vendor90.80
03GPT-5APIOpenAIApr 2026artificial-analysis85
04Grok 4APIxAIApr 2026xai-grok-4-announcement79
05Gemini 2.5 ProGoogleApr 2026google-io-202575.60
06DeepSeek-R1-0528OSSDeepSeekMay 2025deepseek-model-card73.30
07o4-miniAPIOpenAIMar 2024LiveCodeBench: Holistic and Contamination Free Evaluatio… · code72.80
08Qwen3-235B-A22BAlibabaMay 2025arxiv-2505.0938870.70
09o3-miniAPIOpenAIMar 2024LiveCodeBench: Holistic and Contamination Free Evaluatio… · code66.90
10DeepSeek R1OSSDeepSeekJan 2025arxiv-2501.1294865.90
11o3APIOpenAIMar 2024LiveCodeBench: Holistic and Contamination Free Evaluatio… · code65.30
12DeepSeek-R1-Distill-Llama-70BOSSDeepSeekJan 2025arxiv-2501.1294865.20
13Gemini 2.5 FlashGoogleApr 2026llm-stats63.90
14Kimi k1.5APIMoonshot AIJan 2025arxiv-2501.1259962.50
15DeepSeek-R1-Distill-Qwen-32BOSSDeepSeekJan 2025arxiv-2501.1294862.10
16Claude Opus 4APIAnthropicMar 2024LiveCodeBench: Holistic and Contamination Free Evaluatio… · code57.80
17GPT-4.1APIOpenAIMar 2024LiveCodeBench: Holistic and Contamination Free Evaluatio… · code54.40
18Claude Sonnet 4APIAnthropicMar 2024LiveCodeBench: Holistic and Contamination Free Evaluatio… · code52.80
19DeepSeek-v3-0324OSSDeepSeekMar 2025deepseek-model-card49.20
20DeepSeek-V3OSSDeepSeekMar 2024LiveCodeBench: Holistic and Contamination Free Evaluatio… · code49.20
21GPT-4.1 miniAPIOpenAIApr 2026pricepertoken-leaderboard48.30
22Qwen2.5-Coder 32BOSSAlibabaMar 2024LiveCodeBench: Holistic and Contamination Free Evaluatio… · code47.80
23Llama-4-MaverickOSSMetaApr 2025meta-model-card43.40
24DeepSeek-Coder-V2-InstructOSSDeepSeekMar 2024LiveCodeBench: Holistic and Contamination Free Evaluatio… · code43.40
25GPT-4oAPIOpenAIMar 2024LiveCodeBench: Holistic and Contamination Free Evaluatio… · code40.80
26Gemma-3-27bGoogleMar 2025arxiv-2503.1978639
27Llama-4-ScoutOSSMetaApr 2025meta-model-card32.80
28Gemma 3 12B ITGoogle DeepMindMar 2025arxiv-2503.1978632
29Codestral 22BMistralMar 2024LiveCodeBench: Holistic and Contamination Free Evaluatio… · code29.50
30Gemma 3 4B ITGoogle DeepMindMar 2025arxiv-2503.1978623
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

3 steps
of state of the art.

Each row below marks a model that broke the previous record on pass@1. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · pass@1
  1. Mar 12, 2024o4-miniOpenAI72.80
  2. May 28, 2025DeepSeek-R1-0528DeepSeek73.30
  3. Mar 15, 2026Gemini 3 Pro Preview91.70
Fig 3 · SOTA-setting models only. 3 entries span Mar 2024 Mar 2026.
§ 04 · Literature

1 paper
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies