How many models are tracked on HumanEval+?

Codesota tracks 12 models on HumanEval+.

When was the HumanEval+ leaderboard last updated?

The HumanEval+ leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2022.

Codesota · Computer Code · Code Generation · HumanEval+Tasks/Computer Code/Code Generation

Code Generation · benchmark dataset · 2023 · PYTHON

HumanEval+ Extended Version.

Name: HumanEval+ Extended Version Benchmark Results
Creator: Codesota
Published: 2022-01-01
License: https://creativecommons.org/licenses/by/4.0/

Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.

Paper ↗Download dataset Submit a result ↵

§ 01 · Leaderboard

Best published scores.

12 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.

Primary: pass@1 · higher is better

pass-1

12 rows

#	Model	Org	Submitted	Paper / code	pass-1
01	Llama 3 (405B, Instruct)	Meta	Jul 2024	The Llama 3 Herd of Models · code	89
02	Qwen2.5-Plus	—	Dec 2024	Qwen2.5 Technical Report · code	87.80
03	Qwen2.5-VL-72B	—	Feb 2025	Qwen2.5-VL Technical Report · code	87.80
04	MiniCPM-o 4.5-Instruct	—	Apr 2026	MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal … · code	86.60
05	Step-3.5-Flash Base	—	Feb 2026	Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code	81.10
06	Aria	—	Oct 2024	Aria: An Open Multimodal Native Mixture-of-Experts Model · code	73.20
07	Code Llama - Instruct 70B	—	Aug 2023	Code Llama: Open Foundation Models for Code · code	67.80
08	BLT-Entropy 8B	—	Dec 2024	Byte Latent Transformer: Patches Scale Better Than Token… · code	35.40
09	Llama 2 70B (5-shot)	—	Jul 2023	Llama 2: Open Foundation and Fine-Tuned Chat Models · code	29.90
10	LLaMA-65B	—	Feb 2023	LLaMA: Open and Efficient Foundation Language Models · code	23.70
11	SmoLM2 (1.7B)	—	Feb 2025	SmolLM2: When Smol Goes Big -- Data-Centric Training of … · code	22.60
12	BLOOM-176B	—	Nov 2022	BLOOM: A 176B-Parameter Open-Access Multilingual Languag… · code	15.52

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 04 · Literature

12 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
Apr 2026·MiniCPM-o 4.5-Instruct
arXiv ↗Code
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Feb 2026·Step-3.5-Flash Base
arXiv ↗Code
Qwen2.5-VL Technical Report
Feb 2025·Qwen2.5-VL-72B
arXiv ↗Code
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Feb 2025·SmoLM2 (1.7B)
arXiv ↗Code
Qwen2.5 Technical Report
Dec 2024·Qwen2.5-Plus
arXiv ↗Code
Byte Latent Transformer: Patches Scale Better Than Tokens
Dec 2024·BLT-Entropy 8B
arXiv ↗Code
Aria: An Open Multimodal Native Mixture-of-Experts Model
Oct 2024·Aria
arXiv ↗Code
The Llama 3 Herd of Models
Jul 2024·Llama 3 (405B, Instruct)
arXiv ↗Code
Code Llama: Open Foundation Models for Code
Aug 2023·Code Llama - Instruct 70B
arXiv ↗Code
Llama 2: Open Foundation and Fine-Tuned Chat Models
Jul 2023·Llama 2 70B (5-shot)
arXiv ↗Code
LLaMA: Open and Efficient Foundation Language Models
Feb 2023·LLaMA-65B
arXiv ↗Code
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Nov 2022·BLOOM-176B
arXiv ↗Code

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

HumanEval+ Extended Version.

Best published scores.

12 paperstied to this benchmark.

Neighbouring benchmarks.

Have a score that beatsthis table?

12 papers
tied to this benchmark.

Have a score that beats
this table?