General

Computer Use Agents

Computer Use Agents are AI systems that can interact with computer interfaces, understand graphical user interfaces, execute tasks, and navigate software applications autonomously. These agents combine vision, language understanding, and action planning to perform complex computer-based tasks.

11 datasets0 resultsView full task mapping →

Computer Use Agents is a key task in general. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

OSWorld (50 steps)

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

0 results

OSWorld is a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation.

No results tracked yet

OSWorld

0 results

OSWorld is a dataset and benchmark for evaluating multimodal agents on open-ended tasks in real computer environments. It includes 369 tasks on Ubuntu and 43 tasks on Windows, and has a manually annotated version called OSWorld-Human which contains human-determined trajectories for each task to help optimize computer-use agents.

No results tracked yet

OSWorld-Verified

0 results

The OSWorld-Verified dataset is designed for Computer Use Agents. It is a major upgrade to the original OSWorld dataset, providing a more stable and scalable foundation for research and development in computer use agents. It has fixed over 300 issues related to web structure changes, instruction ambiguity, and evaluation robustness. OSWorld-Verified also includes a public evaluation platform for comparisons and the same number of examples (369) as the original OSWorld dataset.

No results tracked yet

BrowseComp

0 results

The BrowseComp dataset is a benchmark for measuring the ability of agents to browse the web. It comprises 1,266 questions that require persistently navigating the internet to find hard-to-find, entangled information. It is used for computer use agents.

No results tracked yet

ScreenSpot-Pro

0 results

ScreenSpot-Pro is a benchmark designed to evaluate GUI grounding models for high-resolution, professional computer-use environments. It includes 23 applications across 5 industries and 3 operating systems. It is a novel benchmark for GUI grounding with authentic tasks collected from various high-resolution professional desktop environments.

No results tracked yet

SSv2 (Screenshot-v2)

ScreenSpot (ScreenSpot-v2)

0 results

ScreenSpot is a cross-platform screenshot grounding benchmark introduced alongside the SeeClick visual GUI agent (Cheng et al., 2024). It contains screenshot images from mobile, web and desktop environments with grounding annotations that map natural-language instructions (or referring expressions) to on-screen UI elements (bounding boxes). The dataset is intended for GUI visual grounding / screenshot understanding tasks (i.e., locating the UI element referred to by a text query) and has been released in HF-hosted variants (e.g. rootsautomation/ScreenSpot and ScreenSpot-v2 entries). A Hugging Face preview of a ScreenSpot-v2 variant shows ~1,272 samples and fields such as image, instruction, bbox, data_source and data_type. Key source: SeeClick (Cheng et al., 2024) which describes constructing ScreenSpot to cover mobile, desktop and web for improving GUI grounding.

No results tracked yet

UI-V (UI-Vision)

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

0 results

UI-Vision (UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction) is a license-permissive benchmark for evaluating desktop GUI perception and interaction. It contains dense, high-quality annotations of human demonstrations across a wide range of real-world desktop applications (the paper reports 83 applications) including bounding boxes and UI element labels, action trajectories (clicks, drag-and-drop, and keyboard inputs), and layout information. The benchmark defines three evaluation tasks — Element Grounding, Layout Grounding, and Action Prediction — with metrics to measure fine-to-coarse agent performance in desktop environments. The dataset is hosted on Hugging Face (ServiceNow/ui-vision) under an MIT license; the HF preview shows a train split (≈1.46k rows) and the repository metadata classifies it as image-text-to-text / image modality.

No results tracked yet

OSW-G (OSWorld-G)

OSWorld-G (OSWorld desktop grounding benchmark)

0 results

OSWorld-G is a desktop GUI grounding benchmark introduced in the paper "Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis" (arXiv:2505.13227). It is designed to evaluate grounding capability for desktop applications — mapping natural language instructions to specific on-screen elements and actions. OSWorld-G comprises 564 finely annotated examples spanning diverse task types including text matching, element recognition, layout understanding, and precise manipulation. The project also releases a much larger synthetic training dataset (Jedi, ~4 million examples) and code/models; the benchmark, data pipeline, and code are open-sourced (GitHub: xlang-ai/OSWorld-G) and a Hugging Face dataset release exists (MMInstruction/OSWorld-G).

No results tracked yet

MMB-GUI (MMBench-GUI)

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

0 results

MMBench-GUI (MMB-GUI) is a hierarchical, multi-platform benchmark for evaluating GUI automation / computer-use agents across Windows, macOS, Linux, iOS, Android and Web. The benchmark is organized into four progressive levels: (L1) GUI Content Understanding, (L2) Element Grounding, (L3) Task Automation, and (L4) Task Collaboration, covering core capabilities from visual understanding to multi-step cross-application task completion. It provides platform-specific splits (desktop, mobile, web) and annotations for grounding (e.g., element bounding boxes and types), tasks, and instructions. The benchmark also proposes an efficiency-aware metric (Efficiency-Quality Area, EQA) to measure both success and action efficiency. The L2 (MMBench-GUI / MMB-GUI Element Grounding) configuration is explicitly intended for testing cross-platform visual grounding (mobile, web, desktop splits). Source and metadata available on Hugging Face (license: Apache-2.0) and the paper is on arXiv (arXiv:2507.19478).

No results tracked yet

WindowsAgentArena

0 results

No results tracked yet

WebArena

0 results

WebArena is a dataset for training agents to perform tasks on web pages. It contains 812 long-term web tasks from 241 templates, with natural language intents, HTML/DOM trees, screenshots, and keyboard/mouse actions.

No results tracked yet

Related Tasks

General

Task for General

World Models

World models are internal, learned representations in AI that function like a "computational snow globe," allowing an agent to understand its environment, predict future states, and simulate the outcomes of actions before acting in the real world. They are essential for building sophisticated AI systems that can reason, make decisions, and interact with complex environments by simulating dynamics like physics, motion, and spatial relationships.

Omni models

Omni models are AI models that take multiple modalities (language, vision, audio) as input and produce multiple modalities as output. Some examples of the first omni models include [Qwen2.5 Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) and [BAGEL](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT).

Video-Language Models

Video Language Models (Video LLMs) are advanced AI systems that combine large language models with video processing capabilities to understand and generate descriptive content from videos. They bridge the gap between visual and textual information by using special encoders to convert video data into a format that a standard text-based large language model (LLM) can process, enabling tasks like video analysis, content generation, and question answering about video content.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Computer Use Agents benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to General