Codesota · Benchmark · MultiPL-EHome/Leaderboards/MultiPL-E
Unknown

MultiPL-E.

MultiPL-E is a multi-programming-language benchmark for evaluating natural-language-to-code generation by large language models. It translates unit-test-driven Python benchmarks (OpenAI HumanEval and MBPP) into parallel problems in multiple programming languages, preserving prompts and test harnesses so models can be evaluated via execution-based metrics. The released dataset provides per-language configurations (e.g., humaneval-<lang>, mbpp-<lang>) containing prompts, tests, doctests, stop tokens and related metadata; the original project translated the Python benchmarks into 18 languages (and Hugging Face distributions expose many language-specific configs). Source code and dataset tooling are available from the NuPRL project (GitHub) and the authors published a paper describing the benchmark and methodology (arXiv:2208.08227 / IEEE TSE publication). License: MIT.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

Not enough data to show trend.
§ 02 · Leaderboard

Results by metric.

Only 1 model on this benchmark
Help build the community leaderboard — submit your model results.

Pass@1

Pass@1 is the reported evaluation metric for MultiPL-E. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pass@1verifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Qwen2.5-Plus
dataset: MultiPL-E; task: 15
paper77N/ASource ↗
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards