General

Coding Agents

Coding agents are autonomous, AI-powered software development tools that understand natural language prompts and execute multi-step tasks to automate coding, bug fixing, and entire software workflows. They act as intelligent assistants within the software development lifecycle, capable of understanding code, generating new code, optimizing existing code, debugging, and handling tasks like documentation and feature scaffolding with minimal user supervision. Examples of coding agents include Claude Code and Cursor Agent.

7 datasets4 resultsView full task mapping →

Coding Agents is a key task in general. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

MultiPL-E

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

1 results

MultiPL-E is a multi-programming-language benchmark for evaluating natural-language-to-code generation by large language models. It translates unit-test-driven Python benchmarks (OpenAI HumanEval and MBPP) into parallel problems in multiple programming languages, preserving prompts and test harnesses so models can be evaluated via execution-based metrics. The released dataset provides per-language configurations (e.g., humaneval-<lang>, mbpp-<lang>) containing prompts, tests, doctests, stop tokens and related metadata; the original project translated the Python benchmarks into 18 languages (and Hugging Face distributions expose many language-specific configs). Source code and dataset tooling are available from the NuPRL project (GitHub) and the authors published a paper describing the benchmark and methodology (arXiv:2208.08227 / IEEE TSE publication). License: MIT.

State of the Art

Qwen2.5-Plus

Pass@1

MBPP

Mostly Basic Python Problems (MBPP)

1 results

Mostly Basic Python Problems (MBPP) is a benchmark for function-level Python code generation consisting of short, entry-level programming problems paired with natural language task descriptions, reference solutions, and automated unit tests. The public Hugging Face versions contain 974 problems (with a sanitized subset of 427 examples available) covering basic numeric, list, and string manipulations and common standard-library usage. MBPP was introduced to evaluate the ability of neural models to synthesize short Python programs from natural language prompts (used in few-shot and fine-tuning evaluations); the dataset is commonly used to report pass@k or exact-match test metrics for code generation models. License: CC BY 4.0.

State of the Art

Qwen2.5-72B-Instruct

88.2

Pass@1

HumanEval

1 results

HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation. The dataset contains coding problems in these five programming languages. The data fields include task_id (indicating the target language and problem ID) and prompt (the function declaration and docstring for code generation).

State of the Art

Qwen2.5-Plus

87.8

Pass@1

LiveCodeBench

1 results

LiveCodeBench is a holistic and contamination-free benchmark for evaluating Large Language Models (LLMs) for code. It collects problems from periodic contests on platforms like LeetCode, AtCoder, and Codeforces to evaluate Code LLMs across various code-related scenarios continuously over time, including code generation, code execution, and test output prediction.

State of the Art

Qwen2.5-72B-Instruct

55.5

Pass@1

SciCode

SciCode: A Research Coding Benchmark Curated by Scientists

0 results

SciCode is a scientist-curated benchmark for evaluating language models on real-world scientific programming problems. The benchmark contains a collection of research-level Python coding problems (reported as 65 main problems spanning Chemistry, Materials Science, Biology, Math, and Physics), where each main problem is further divided into sub-problems that all must be solved correctly to obtain the main result. The benchmark provides gold-standard reference implementations and datasets for verifying calculations, example problems, and a leaderboard; it supports evaluation with optional detailed "background" context to give models the relevant theory and mathematical setup. SciCode was introduced in the paper "SciCode: A Research Coding Benchmark Curated by Scientists" (arXiv:2407.13168) and is maintained on GitHub and the project website; an official Hugging Face dataset mirror is also available.

No results tracked yet

CRUX-O

CRUXEval-O (CRUXEval output-prediction subset)

0 results

CRUXEval-O (also referenced as CRUX-O) is the output-prediction task/subset of CRUXEval, a benchmark for code reasoning, understanding, and execution. CRUXEval contains 800 Python functions (each 3–13 lines) sampled/generated as described in the paper; each function is paired with an input and its correct output. The CRUXEval-O task requires models to predict the correct output given a Python function and its input(s), measuring a model’s ability to reason about program execution and produce correct outputs (complementary to CRUXEval-I which targets input prediction). The benchmark is intended to evaluate execution-level reasoning beyond standard code-generation benchmarks (e.g., HumanEval, MBPP). The dataset and evaluation code are released under the MIT license.

No results tracked yet

SWE-Bench Verified

0 results

SWE-Bench Verified is a subset of 500 human-validated samples from the SWE-bench test set. It evaluates AI models' ability to solve real-world software issues and includes expert-verified solvable problems.

No results tracked yet

Related Tasks

General

Task for General

World Models

World models are internal, learned representations in AI that function like a "computational snow globe," allowing an agent to understand its environment, predict future states, and simulate the outcomes of actions before acting in the real world. They are essential for building sophisticated AI systems that can reason, make decisions, and interact with complex environments by simulating dynamics like physics, motion, and spatial relationships.

Omni models

Omni models are AI models that take multiple modalities (language, vision, audio) as input and produce multiple modalities as output. Some examples of the first omni models include [Qwen2.5 Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) and [BAGEL](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT).

Video-Language Models

Video Language Models (Video LLMs) are advanced AI systems that combine large language models with video processing capabilities to understand and generate descriptive content from videos. They bridge the gap between visual and textual information by using special encoders to convert video data into a format that a standard text-based large language model (LLM) can process, enabling tasks like video analysis, content generation, and question answering about video content.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Coding Agents benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to General

Coding Agents Benchmarks - General - CodeSOTA | CodeSOTA