MultiPL-E.

Name: MultiPL-E Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

MultiPL-E is a multi-programming-language benchmark for evaluating natural-language-to-code generation by large language models. It translates unit-test-driven Python benchmarks (OpenAI HumanEval and MBPP) into parallel problems in multiple programming languages, preserving prompts and test harnesses so models can be evaluated via execution-based metrics. The released dataset provides per-language configurations (e.g., humaneval-<lang>, mbpp-<lang>) containing prompts, tests, doctests, stop tokens and related metadata; the original project translated the Python benchmarks into 18 languages (and Hugging Face distributions expose many language-specific configs). Source code and dataset tooling are available from the NuPRL project (GitHub) and the authors published a paper describing the benchmark and methodology (arXiv:2208.08227 / IEEE TSE publication). License: MIT.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

Not enough data to show trend.

§ 02 · Leaderboard

Results by metric.

Only 1 model on this benchmark

Help build the community leaderboard — submit your model results.

Pass@1

Pass@1 is the reported evaluation metric for MultiPL-E. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pass@1verifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Qwen2.5-Plus dataset: MultiPL-E; task: 15	paper	77	N/A	Source ↗

§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards