Codesota · Benchmark · HumanEvalHome/Leaderboards/HumanEval
Unknown

HumanEval.

HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation. The dataset contains coding problems in these five programming languages. The data fields include task_id (indicating the target language and problem ID) and prompt (the function declaration and docstring for code generation).

Paper Leaderboard
§ 01 · Leaderboard

Results by metric.

Only 1 model on this benchmark
Help build the community leaderboard — submit your model results.
Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Pass@1

Pass@1 is the reported evaluation metric for HumanEval. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pass@1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Qwen2.5-Plus
dataset: HumanEval; task: 15
paper87.8N/APaper ↗Code ↗Source ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards