General

Omni models

Omni models are AI models that take multiple modalities (language, vision, audio) as input and produce multiple modalities as output. Some examples of the first omni models include [Qwen2.5 Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) and [BAGEL](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT).

2 datasets0 resultsView full task mapping →

Omni models is a key task in general. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

WorldSense

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

0 results

WorldSense is a real-world omni-modal benchmark for evaluating multimodal LLMs on audio-visual-text video understanding. The benchmark contains synchronized audio, visual, and text inputs and is designed to require synergistic use of audio and video signals. WorldSense comprises a diverse collection of 1,662 audio-visual synchronized videos organized into 8 primary domains and 67 fine-grained subcategories, with 3,172 multiple-choice QA pairs spanning 26 distinct task types. Annotations were produced and quality-checked by expert annotators. The benchmark targets grounded reasoning and comprehensive evaluation of models that must integrate vision, audio, and textual cues.

No results tracked yet

DailyOmni

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

0 results

Daily-Omni is an audio–visual question-answering benchmark for audio-visual reasoning that emphasizes temporal alignment across modalities. According to the dataset repository and paper listing, Daily-Omni comprises real-world short videos of daily-life scenarios and multiple-choice QA pairs designed to require integration of audio and visual streams. The project provides a QA generation pipeline (automatic annotation, QA generation and QA optimization) to scale creation and human evaluation, and includes a baseline agent (Daily-Omni-Agent) combining open-source visual-language, audio-language and ASR models with simple temporal alignment methods. The dataset listing on Hugging Face and the project repository report the benchmark contains 684 videos and 1,197 multiple-choice QA pairs across six main task categories (question-answering tasks focused on audio-visual integration).

No results tracked yet

Related Tasks

General

Task for General

World Models

World models are internal, learned representations in AI that function like a "computational snow globe," allowing an agent to understand its environment, predict future states, and simulate the outcomes of actions before acting in the real world. They are essential for building sophisticated AI systems that can reason, make decisions, and interact with complex environments by simulating dynamics like physics, motion, and spatial relationships.

Video-Language Models

Video Language Models (Video LLMs) are advanced AI systems that combine large language models with video processing capabilities to understand and generate descriptive content from videos. They bridge the gap between visual and textual information by using special encoders to convert video data into a format that a standard text-based large language model (LLM) can process, enabling tasks like video analysis, content generation, and question answering about video content.

Coding Agents

Coding agents are autonomous, AI-powered software development tools that understand natural language prompts and execute multi-step tasks to automate coding, bug fixing, and entire software workflows. They act as intelligent assistants within the software development lifecycle, capable of understanding code, generating new code, optimizing existing code, debugging, and handling tasks like documentation and feature scaffolding with minimal user supervision. Examples of coding agents include Claude Code and Cursor Agent.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Omni models benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000