Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.
Pass 1 is the reported evaluation metric for HumanEval+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Muted rows were not state of the art when published — an earlier or same-year result already scored better.
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Llama 3 (405B, Instruct) | unverified | 89 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 02 | Qwen2.5-Plus | unverified | 87.8 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 03 | Qwen2.5-VL-72B | unverified | 87.8 | 2025 | Paper ↗Code ↗ | Looks wrong? |
| 04 | MiniCPM-o 4.5-Instruct | unverified | 86.6 | 2026 | Paper ↗Code ↗ | Looks wrong? |
| 05 | Step-3.5-Flash Base | unverified | 81.1 | 2026 | Paper ↗Code ↗ | Looks wrong? |
| 06 | Aria | unverified | 73.2 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 07 | Code Llama - Instruct 70B | unverified | 67.8 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 08 | BLT-Entropy 8B | unverified | 35.4 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 09 | Llama 2 70B (5-shot) | unverified | 29.9 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 10 | LLaMA-65B | unverified | 23.7 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 11 | SmoLM2 (1.7B) | unverified | 22.6 | 2025 | Paper ↗Code ↗ | Looks wrong? |
| 12 | BLOOM-176B | unverified | 15.52 | 2022 | Paper ↗Code ↗ | Looks wrong? |
Pass@1 is the reported evaluation metric for HumanEval+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Muted rows were not state of the art when published — an earlier or same-year result already scored better.
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Qwen2.5-Coder-32B | verified | 87.2 | 2024 | Source ↗ | Looks wrong? |
| 02 | DeepSeek-V3 | verified | 86.6 | 2025 | Source ↗ | Looks wrong? |
| 03 | GPT-4o | verified | 86 | 2024 | Source ↗ | Looks wrong? |
| 04 | DeepSeek-Coder-V2 | verified | 82.3 | 2024 | Source ↗ | Looks wrong? |
| 05 | DeepSeek-Coder-33B | verified | 75 | 2024 | Source ↗ | Looks wrong? |