Codesota · Computer Code · Code Generation · HumanEval+Tasks/Computer Code/Code Generation
Code Generation · benchmark dataset · 2023 · PYTHON

HumanEval+ Extended Version.

Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.

Paper Download datasetSubmit a result
§ 01 · Leaderboard

Best published scores.

12 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
pass@1 · higher is better
pass-1
12 rows
#ModelOrgSubmittedPaper / codepass-1
01Llama 3 (405B, Instruct)MetaJul 2024The Llama 3 Herd of Models · code89
02Qwen2.5-PlusDec 2024Qwen2.5 Technical Report · code87.80
03Qwen2.5-VL-72BFeb 2025Qwen2.5-VL Technical Report · code87.80
04MiniCPM-o 4.5-InstructApr 2026MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal … · code86.60
05Step-3.5-Flash BaseFeb 2026Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code81.10
06AriaOct 2024Aria: An Open Multimodal Native Mixture-of-Experts Model · code73.20
07Code Llama - Instruct 70BAug 2023Code Llama: Open Foundation Models for Code · code67.80
08BLT-Entropy 8BDec 2024Byte Latent Transformer: Patches Scale Better Than Token… · code35.40
09Llama 2 70B (5-shot)Jul 2023Llama 2: Open Foundation and Fine-Tuned Chat Models · code29.90
10LLaMA-65BFeb 2023LLaMA: Open and Efficient Foundation Language Models · code23.70
11SmoLM2 (1.7B)Feb 2025SmolLM2: When Smol Goes Big -- Data-Centric Training of … · code22.60
12BLOOM-176BNov 2022BLOOM: A 176B-Parameter Open-Access Multilingual Languag… · code15.52
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 04 · Literature

12 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies
HumanEval+ — Code Generation | CodeSOTA