Computer Vision

Image editing

Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.

5 datasets0 resultsView full task mapping →

Image editing is a key task in computer vision. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

PICABench

PICABench: A Comprehensive Benchmark for Physically Realistic Image Editing

0 results

A comprehensive and fine-grained benchmark for physically realistic image editing. PICABench evaluates physical realism in image editing beyond semantic fidelity, categorizing physical consistency into three dimensions: Optics (light propagation, reflection, refraction), Mechanics (deformation and causality), and State Transition (global and local state changes).

No results tracked yet

ImgEdit (ImgEdit benchmark)

ImgEdit: A Unified Image Editing Dataset and Benchmark

0 results

ImgEdit is a large-scale, high-quality image-editing dataset and benchmark introduced to improve open-source image-editing model performance. The dataset contains approximately 1.2 million curated edit pairs covering novel and complex single-turn edits as well as challenging multi-turn editing tasks. Data were produced via a multi-stage pipeline that leverages a vision-language model, object detection, segmentation, task-specific in-painting procedures, and strict post-processing to ensure high quality. The release also includes ImgEdit-Bench, a benchmark suite that evaluates instruction adherence, editing quality, and detail preservation (with basic, challenging single-turn, and multi-turn test suites), and the authors train an editing model (ImgEdit-E1) demonstrating gains over prior open-source editors. Code and data pointers are provided in the project repository.

No results tracked yet

GEdit-Bench

0 results

GEdit-Bench is a real-world image-editing evaluation benchmark released by the StepFun / Step1X-Edit team to assess image-editing models on authentic user instructions. The Hugging Face dataset contains ~1.21k examples (single split: train) of image + editing instruction pairs and metadata. The schema includes fields such as task_type (11 edit categories), key, instruction, instruction_language (en/zh), input_image / input_image_raw, and Intersection_exist. The benchmark was designed for automatic/LLM-based evaluation — the Step1X-Edit paper and project report model scores computed by GPT-4.1 (and comparisons to other graders such as Qwen2.5-VL). Dataset is hosted on Hugging Face (MIT license) and was introduced alongside the Step1X-Edit paper (arXiv:2504.17761).

No results tracked yet

KRIS-Bench

KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark)

0 results

KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark) is a diagnostic benchmark for instruction-driven image editing that focuses on models' knowledge-based reasoning rather than only visual fidelity. It organizes editing tasks into a cognitively informed taxonomy of knowledge types (Factual, Conceptual, Procedural) and defines 22 representative editing tasks designed to probe different forms of knowledge reasoning in image editing. KRIS-Bench provides per-task sub-metrics and a composite Knowledge Plausibility metric; the authors also report an overall score and several sub-scores evaluated using a large LLM annotator (reported using GPT-4o). The dataset is released in Parquet format and combines image inputs with natural-language editing instructions, knowledge-based explanations, and ground-truth edited images. Typical fields include: category (task category), id (sample id), instruction (editing instruction text), explanation (knowledge-based explanation), image (input image) and gt_image (ground-truth edited image). The publicly available dataset on Hugging Face contains ~1.27k samples (single split) and is released under a permissive license (dataset replicas on HF list CC-BY-4.0 / Apache-2.0 variants).

No results tracked yet

RISEBench

RISEBench (Reasoning-Informed viSual Editing Benchmark)

0 results

RISEBench (RISE: Reasoning-Informed viSual Editing Benchmark) is a benchmark and dataset for evaluating multimodal models on instruction-driven image editing tasks that require deeper reasoning. Introduced in the paper “Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing” (arXiv:2504.02826), RISEBench focuses on four reasoning categories — Temporal, Causal, Spatial and Logical reasoning — and provides expert-curated test cases for each. The benchmark pairs input images with complex editing instructions that require understanding scene context and reasoning beyond low-level appearance changes. The authors propose an evaluation framework measuring Instruction Reasoning, Appearance Consistency, and Visual Plausibility using both human judges and an “LMM-as-a-judge” protocol; they evaluate a range of open-source and proprietary LMMs (reporting results for systems such as GPT-4o / GPT-4o-Image in the paper). The project repository (GitHub) and Hugging Face dataset release include the data, evaluation scripts and example runs. Reported dataset scope in sources: 360 high-quality, human-expert curated test cases covering the four reasoning categories. Primary resources: arXiv paper (2504.02826), official GitHub (PhoenixZ810/RISEBench) and Hugging Face dataset page (PhoenixZ/RISEBench).

No results tracked yet

Related Tasks

Few-Shot Image Classification

Image classification with limited labeled examples per class (few-shot learning). Models are evaluated on their ability to classify images into categories with only a handful of training examples (typically 1-10) per class.

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Image editing benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision

Image editing Benchmarks - Computer Vision - CodeSOTA | CodeSOTA