Coding Agents
Coding agents are autonomous, AI-powered software development tools that understand natural language prompts and execute multi-step tasks to automate coding, bug fixing, and entire software workflows. They act as intelligent assistants within the software development lifecycle, capable of understanding code, generating new code, optimizing existing code, debugging, and handling tasks like documentation and feature scaffolding with minimal user supervision. Examples of coding agents include Claude Code and Cursor Agent.
Coding Agents is a key task in general. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
MultiPL-E
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation
MultiPL-E is a multi-programming-language benchmark for evaluating natural-language-to-code generation by large language models. It translates unit-test-driven Python benchmarks (OpenAI HumanEval and MBPP) into parallel problems in multiple programming languages, preserving prompts and test harnesses so models can be evaluated via execution-based metrics. The released dataset provides per-language configurations (e.g., humaneval-<lang>, mbpp-<lang>) containing prompts, tests, doctests, stop tokens and related metadata; the original project translated the Python benchmarks into 18 languages (and Hugging Face distributions expose many language-specific configs). Source code and dataset tooling are available from the NuPRL project (GitHub) and the authors published a paper describing the benchmark and methodology (arXiv:2208.08227 / IEEE TSE publication). License: MIT.
State of the Art
Qwen2.5-Plus
77
Pass@1
MBPP
Mostly Basic Python Problems (MBPP)
Mostly Basic Python Problems (MBPP) is a benchmark for function-level Python code generation consisting of short, entry-level programming problems paired with natural language task descriptions, reference solutions, and automated unit tests. The public Hugging Face versions contain 974 problems (with a sanitized subset of 427 examples available) covering basic numeric, list, and string manipulations and common standard-library usage. MBPP was introduced to evaluate the ability of neural models to synthesize short Python programs from natural language prompts (used in few-shot and fine-tuning evaluations); the dataset is commonly used to report pass@k or exact-match test metrics for code generation models. License: CC BY 4.0.
State of the Art
Qwen2.5-72B-Instruct
88.2
Pass@1
HumanEval
HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation. The dataset contains coding problems in these five programming languages. The data fields include task_id (indicating the target language and problem ID) and prompt (the function declaration and docstring for code generation).
State of the Art
Qwen2.5-Plus
87.8
Pass@1
LiveCodeBench
LiveCodeBench is a holistic and contamination-free benchmark for evaluating Large Language Models (LLMs) for code. It collects problems from periodic contests on platforms like LeetCode, AtCoder, and Codeforces to evaluate Code LLMs across various code-related scenarios continuously over time, including code generation, code execution, and test output prediction.
State of the Art
Qwen2.5-72B-Instruct
55.5
Pass@1
SciCode
SciCode: A Research Coding Benchmark Curated by Scientists
SciCode is a scientist-curated benchmark for evaluating language models on real-world scientific programming problems. The benchmark contains a collection of research-level Python coding problems (reported as 65 main problems spanning Chemistry, Materials Science, Biology, Math, and Physics), where each main problem is further divided into sub-problems that all must be solved correctly to obtain the main result. The benchmark provides gold-standard reference implementations and datasets for verifying calculations, example problems, and a leaderboard; it supports evaluation with optional detailed "background" context to give models the relevant theory and mathematical setup. SciCode was introduced in the paper "SciCode: A Research Coding Benchmark Curated by Scientists" (arXiv:2407.13168) and is maintained on GitHub and the project website; an official Hugging Face dataset mirror is also available.
No results tracked yet
CRUX-O
CRUXEval-O (CRUXEval output-prediction subset)
CRUXEval-O (also referenced as CRUX-O) is the output-prediction task/subset of CRUXEval, a benchmark for code reasoning, understanding, and execution. CRUXEval contains 800 Python functions (each 3–13 lines) sampled/generated as described in the paper; each function is paired with an input and its correct output. The CRUXEval-O task requires models to predict the correct output given a Python function and its input(s), measuring a model’s ability to reason about program execution and produce correct outputs (complementary to CRUXEval-I which targets input prediction). The benchmark is intended to evaluate execution-level reasoning beyond standard code-generation benchmarks (e.g., HumanEval, MBPP). The dataset and evaluation code are released under the MIT license.
No results tracked yet
SWE-Bench Verified
SWE-Bench Verified is a subset of 500 human-validated samples from the SWE-bench test set. It evaluates AI models' ability to solve real-world software issues and includes expert-verified solvable problems.
No results tracked yet
Related Tasks
General
Task for General
World Models
World models are internal, learned representations in AI that function like a "computational snow globe," allowing an agent to understand its environment, predict future states, and simulate the outcomes of actions before acting in the real world. They are essential for building sophisticated AI systems that can reason, make decisions, and interact with complex environments by simulating dynamics like physics, motion, and spatial relationships.
Omni models
Omni models are AI models that take multiple modalities (language, vision, audio) as input and produce multiple modalities as output. Some examples of the first omni models include [Qwen2.5 Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) and [BAGEL](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT).
Video-Language Models
Video Language Models (Video LLMs) are advanced AI systems that combine large language models with video processing capabilities to understand and generate descriptive content from videos. They bridge the gap between visual and textual information by using special encoders to convert video data into a format that a standard text-based large language model (LLM) can process, enabling tasks like video analysis, content generation, and question answering about video content.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Coding Agents benchmarks accurate. Report outdated results, missing benchmarks, or errors.