Codesota · Benchmark · MBPPHome/Leaderboards/MBPP
Unknown

MBPP.

Mostly Basic Python Problems (MBPP) is a benchmark for function-level Python code generation consisting of short, entry-level programming problems paired with natural language task descriptions, reference solutions, and automated unit tests. The public Hugging Face versions contain 974 problems (with a sanitized subset of 427 examples available) covering basic numeric, list, and string manipulations and common standard-library usage. MBPP was introduced to evaluate the ability of neural models to synthesize short Python programs from natural language prompts (used in few-shot and fine-tuning evaluations); the dataset is commonly used to report pass@k or exact-match test metrics for code generation models. License: CC BY 4.0.

Paper Leaderboard
§ 01 · Leaderboard

Results by metric.

Only 1 model on this benchmark
Help build the community leaderboard — submit your model results.
Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Pass@1

Pass@1 is the reported evaluation metric for MBPP. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pass@1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Qwen2.5-72B-Instruct
dataset: MBPP; task: 15
paper88.2N/APaper ↗Code ↗Source ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards