Computer Vision

Image generation

AI image generation uses generative AI to create new visual content from text prompts or existing images, by learning patterns from massive datasets of images and text. These trained algorithms, often neural networks, then produce novel images that are statistically likely to fit the provided prompt, mimicking styles, shapes, and colors they've learned. Examples of such models include Midjourney, Stable Diffusion, and DALL-E.

11 datasets0 resultsView full task mapping →

Image generation is a key task in computer vision. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

GenEval

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

0 results

GenEval is an object-focused evaluation framework for text-to-image alignment that enables fine-grained, instance-level evaluation of compositional generation. Instead of holistic metrics like FID or CLIPScore, GenEval evaluates object-level properties such as object co-occurrence, spatial relations/position, object count, and attribute binding (e.g., color). The framework leverages off-the-shelf object detectors and other discriminative vision models to build automated, verifiable evaluators that correlate well with human judgments. The authors provide code, evaluation scripts, and benchmark prompts/tasks (repository: https://github.com/djghosh13/geneval, MIT license) to run GenEval evaluators against text-to-image models and to report per-task scores for multi-object composition and related evaluations.

No results tracked yet

ImageNet 256x256

ImageNet (ILSVRC2012) 256x256

0 results

ImageNet 256x256 is a commonly used resized variant of the ImageNet ILSVRC2012 (ImageNet-1k) image classification dataset where images have been resized / center-cropped and rescaled to 256x256 pixels. The underlying dataset (ILSVRC2012) contains 1.28M training images, 50K validation images and 100K test images across 1000 classes. The 256x256 variant is provided as a convenience for faster downloads and for workflows that perform random crops (e.g., 224x224 crops from 256). This variant is widely used in image-generation and generative-model evaluation papers (e.g., for computing FID and Inception Score using 50K generated images — 50 samples per class for the 1000 classes). The dataset is not a new collection but a transformed / resized version of the standard ImageNet (ILSVRC2012) split. Representative Hugging Face repacks include evanarlian/imagenet_1k_resized_256 and benjamin-paine/imagenet-1k-256x256 which describe the exact resize/center-crop procedures and point to the original ImageNet/ILSVRC references.

No results tracked yet

ImageNet 1024x1024

ImageNet (ILSVRC2012) — 1024×1024 resized variant

0 results

ImageNet (ILSVRC2012) is a large-scale image classification dataset organized according to the WordNet hierarchy. The standard ImageNet-1k (ILSVRC) split contains 1,000 classes with roughly 1.2–1.3M training images and 50K validation images. In image generation literature the term “ImageNet 1024x1024” typically denotes the ImageNet-1k images resized (or center-cropped/resampled) to 1024×1024 resolution for high-resolution synthesis and evaluation. Common evaluation practice (used by many generative-model papers) is to generate 50,000 images (50 samples per each of the 1000 classes) and compute FID against the ImageNet training set at the target resolution.

No results tracked yet

ImageNet 512x512

ImageNet (ILSVRC2012 / ImageNet-1k)

0 results

“ImageNet 512x512” refers to the ImageNet ILSVRC-2012 (ImageNet-1k) image classification dataset commonly used for research, where the original images are redistributed or preprocessed (resized / center-cropped) to 512×512 resolution for training or evaluation of generative and conditional models. ImageNet ILSVRC-2012 contains 1,281,167 training images across 1000 classes and 50,000 validation images (50 images per class), which is why many generative-model papers report FID / Inception Score using 50,000 generated samples (often 50 samples per class) and compute scores against the ImageNet training/validation sets. Core references: the dataset introduction (Deng et al., CVPR 2009) and the ImageNet Large Scale Visual Recognition Challenge paper (Russakovsky et al., arXiv:1409.0575). Official dataset info and download/size counts are listed on the ImageNet site; a Hugging Face repack of the ILSVRC/imagenet-1k dataset is available at the linked HF repo.

No results tracked yet

LongText-Bench

0 results

LongText-Bench is a small benchmark dataset for evaluating models’ ability to render long textual content in generated images. Released by the X-Omni team, it provides English and Chinese tracks and is intended for text-to-image evaluation focused on longer textual content (paragraphs, multi-word strings) and different content categories. The dataset on Hugging Face contains a single split (train) with 320 examples and fields such as category (8 classes), length (short/long), prompt (text prompt), text (the target textual content to render), text_length, and prompt_id. Metadata on the HF page lists the dataset language as English and Chinese, license Apache-2.0, and tags it with task_categories:text-to-image.

No results tracked yet

ICE-Bench (Task1-31 Overall)

ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

0 results

ICE-Bench (ICE = Image Creating and Editing) is a unified, multi-task benchmark for evaluating image generation and image editing models. Introduced in the paper “ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing” (arXiv:2503.14482), it decomposes image creation/editing into four coarse categories (no-reference / reference × creating / editing) and further into 31 fine-grained tasks (Task 1–31). The benchmark uses a multi-dimensional evaluation protocol spanning 6 evaluation dimensions and 11 automatic metrics that measure imaging quality, prompt following, source consistency, reference consistency, controllability, and aesthetics. The authors provide benchmark code to compute per-task scores and an overall “Task1-31” aggregate score; the dataset and automated evaluation code are released (MIT license) on Hugging Face (ali-vilab/ICE-Bench).

No results tracked yet

OneIG-ZH

OneIG-Bench — Chinese track (OneIG-Bench-ZH / OneIG-ZH)

0 results

OneIG-Bench — Chinese track (commonly referenced as OneIG-ZH or OneIG-Bench-ZH) is the Chinese subset of OneIG-Bench, an omni-dimensional benchmark for evaluating text-to-image (T2I) models. OneIG-Bench was designed to provide fine-grained evaluation across multiple dimensions including subject-element alignment, text rendering precision, reasoning, stylization, and diversity. The Chinese track uses Chinese prompts (same benchmark design as the English track / OneIG-EN) and reports per-dimension scores as well as an overall score to assess T2I model performance on Chinese-language prompts (the Hugging Face dataset lists the OneIG-Bench-ZH subset at ~1.32k rows). License: CC BY-NC 4.0. Sources: OneIG-Bench paper and Hugging Face dataset page.

No results tracked yet

CVTG-2K

0 results

CVTG-2K is a benchmark for Complex Visual Text Generation (CVTG) containing 2,000 prompts designed to evaluate text rendering in generated images. According to the dataset card on Hugging Face and the TextCrafter (arXiv:2503.23461) paper that introduces it, prompts were generated via OpenAI's O1-mini API (using chain-of-thought techniques) and cover diverse scenes such as street views, advertisements, and book covers. The dataset emphasizes longer visual texts (mean ~8.10 words / ~39.47 characters) and multiple text regions (2–5 regions per prompt). About half the prompts include stylistic attributes (size, color, font). CVTG-2K provides fine-grained, decoupled prompt structures and carrier words to express text–position relationships, making it suitable for evaluating multi-region text rendering and stylization in text-conditioned image generation. Evaluation metrics reported for CVTG tasks include Word Accuracy and Normalized Edit Distance (NED). (Sources: Hugging Face dataset card for dnkdnk/CVTG-2K and TextCrafter paper, arXiv:2503.23461.)

No results tracked yet

TIIF-Bench mini

TIIF-Bench (Text-to-Image Instruction Following Benchmark) — mini (compact evaluation subset)

0 results

TIIF-Bench (Text-to-Image Instruction Following Benchmark) is a benchmark introduced to systematically evaluate how well text-to-image (T2I) models interpret and follow detailed user instructions. The full TIIF-Bench (paper arXiv:2506.02161) organizes prompts across multiple concept pools and six compositional prompt dimensions (including new dedicated dimensions for text rendering and style control), provides concise and extended prompt variants, and proposes fine-grained evaluation metrics for alignment between textual instructions and generated images. The authors also publish the images generated by evaluated (proprietary) models on the Hugging Face Hub as A113NW3I/TIIF-Bench-Data. "TIIF-Bench mini" refers to the compact/specialized subset used by the authors for (faster) text-to-image instruction-following evaluation in their experiments (the full benchmark and released HF dataset are linked below).

No results tracked yet

OmniContext

0 results

OmniContext is a small subject-driven any-to-image / image-to-image benchmark (400 examples) released as part of the OmniGen2 project to evaluate in-context image generation. The benchmark contains diverse input images and natural-language instructions and is organized into per-setting categories (reported in evaluations as SINGLE / MULTIPLE / SCENE) with per-setting and average scores. Evaluation is automated using an LLM-based, interpretable metric pipeline (the dataset page cites GPT-4.1 for metric-driven assessment). The Hugging Face dataset provides a single split (train, 400 rows) and fields such as task_type, instruction, and input_images; license: Apache-2.0. Project resources and code are available from the OmniGen2 project (GitHub) and the dataset page on Hugging Face.

No results tracked yet

OneIG-EN

OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation (English track — OneIG-EN)

0 results

OneIG-Bench (English track, often referred to as OneIG-EN) is a benchmark dataset and evaluation suite for text-to-image (T2I) generation models that provides fine-grained, omni-dimensional human/evaluator judgments. It evaluates T2I outputs along five dimensions — Alignment (prompt-image semantic alignment), Text (text rendering and fidelity), Reasoning (knowledge- or logic-based correctness), Style (stylistic fidelity and diversity of styles), and Diversity — and reports both per-dimension scores and an Overall score. The Hugging Face dataset release contains two subsets: an English subset (OneIG-Bench, ~1.12k prompts/instances) and a Chinese subset (OneIG-Bench-ZH, ~1.32k prompts/instances), together covering ~2.44k test cases used by the paper. Metadata on the Hugging Face page lists the task category as text-to-image, the license as CC-BY-NC-4.0, and links to the paper, project page, and code repository.

No results tracked yet

Related Tasks

Few-Shot Image Classification

Image classification with limited labeled examples per class (few-shot learning). Models are evaluated on their ability to classify images into categories with only a handful of training examples (typically 1-10) per class.

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Image generation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision