General

A broad category encompassing machine learning research and tasks that don't fit specifically into vision or language domains, including general ML methods, optimization, and cross-domain approaches.

11 tasks87 datasets8 results

Tasks & Benchmarks

Show all datasets and SOTA results

Video-Language Models

CG-BenchCG-Bench: A Comprehensive Benchmark for Computer Graphics Understanding

CinePileCinePile: A Long Video Question Answering Dataset and Benchmark

EgoLifeEgoLife: Towards Egocentric Life Assistant

EgoSchemaEgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

LVBench

MLVU

MMVUMMVU: Measuring Expert-Level Multi-Discipline Video Understanding

MMWorldMMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

MVBench

MVPMinimal Video Pairs (MVP)

PLM-VideoBenchPLM-VideoBench (PerceptionLM Video Benchmark)

67.7(MBAcc)PLM (8B)

Perception Test

TOMATOTOMATO (Temporal Reasoning Multimodal Evaluation)

TempCompassTempCompass

TemporalBench (MBA-short QA)TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Video-MME

Video-MMLUVideo-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark2025

Video-MMMUVideo-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

VideoHolmesVideo-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Coding Agents

CRUX-OCRUXEval-O (CRUXEval output-prediction subset)

HumanEval

87.8(Pass@1)Qwen2.5-Plus

LiveCodeBench

55.5(Pass@1)Qwen2.5-72B-Instruct

MBPPMostly Basic Python Problems (MBPP)

88.2(Pass@1)Qwen2.5-72B-Instruct

MultiPL-EMultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

77(Pass@1)Qwen2.5-Plus

SWE-Bench Verified

SciCodeSciCode: A Research Coding Benchmark Curated by Scientists

Embedding models

No datasets indexed yet. Contribute on GitHub

General

Humanity's Last Exam (HLE)

Omni models

DailyOmniDaily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

WorldSenseWorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Reasoning

No datasets indexed yet. Contribute on GitHub

Reinforcement Learning

No datasets indexed yet. Contribute on GitHub

Retrieval

AmsterTimeAmsterTime: A Visual Place Recognition Benchmark Dataset for Severe Domain Shift

BEIRBEIR — Benchmarking-IR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models

CodeSearchNet (CSN)CodeSearchNet Corpus

INRIA Copydays (strong subset)INRIA CopyDays

MLDR (English subset)MLDR (Multilingual Long-Document Retrieval) — English subset

Revisited Paris (R_Par) — Medium splitRevisited Paris (RParis / R_Par / RParis6k) — Medium split

StackOverflow-QA (StackQA)StackOverflow-QA (StackQA)

Vision-Language Models

A12DAI2D (AI2 Diagrams Dataset) — “A Diagram Is Worth A Dozen Images”

GQAGQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

HallusionBenchHallusionBench: An Advanced Diagnostic Suite for Spotting Language Hallucination

InfoVQA

IntelligentBenchIntelligentBench (BAGEL evaluation suite)

M-LongDocM-LongDoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

MEGA-Bench (macro)MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

MM-VetMM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities ("Multimodal Veterinarian")

MMBench-CNMMBench Chinese Test: Is Your Multi-modal Model an All-around Player?

MMBench-ENMMBench English Test: Is Your Multi-modal Model an All-around Player?

MMBench-V1.1MMBench V1.1 Test

MMEMME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MMMU

MMMU-Pro

MMStarMMStar: Are We on the Right Way for Evaluating Large Vision-Language Models?

MMT-BenchMMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MTVQAMTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

MathVision

MathVista

Meta-World authors' collected datasetMeta-World MT50 (authors' collected dataset)

NIH/Multi-needleMMNeedle (Multimodal Needle-in-a-haystack)

OlympiadBench (full)OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

OmniBenchOmniBench

RealWorldQA

RefCOCOReferring Expressions COCO2016

RefCOCO / RefCOCO+ / RefCOCOg (overall)RefCOCO / RefCOCO+ / RefCOCOg (referring-expression visual grounding datasets on MS COCO)

SEED (SeedBench)SEED-Bench

SO100 real-world: Pick-Place, Stacking, SortingSO100 (real-world: Pick-Place, Stacking, Sorting)

SO101 real-world: Pick-Place-LegoSO101 (real-world: Pick-Place-Lego) — lerobot/svla_so101_pickplace

TextVQA

VCR-Wiki-EN-EasyVCR-Wiki English Easy: Visual Caption Restoration

VCR-Wiki-ZH-EasyVCR-Wiki Chinese Easy: Visual Caption Restoration

VQAv2Visual Question Answering v2.0 (VQA v2.0)

Vibe-EvalVibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

WISEWISE: A World Knowledge-Informed Semantic Evaluation

ZeroBench

World Models

No datasets indexed yet. Contribute on GitHub

Computer Use Agents

BrowseComp

MMB-GUI (MMBench-GUI)MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

OSW-G (OSWorld-G)OSWorld-G (OSWorld desktop grounding benchmark)

OSWorld

OSWorld (50 steps)OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

OSWorld-Verified

SSv2 (Screenshot-v2)ScreenSpot (ScreenSpot-v2)

ScreenSpot-Pro

UI-V (UI-Vision)UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

WebArena

WindowsAgentArena

Get notified when these results update

New models drop weekly. We track them so you don't have to.