General Benchmarks - General - CodeSOTA

General

Task for General

1 datasets0 resultsView full task mapping →

General is a key task in general. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

Humanity's Last Exam (HLE)

0 results

A highly challenging, multi-modal benchmark with 2,500 expert-level academic questions across a broad range of disciplines, designed to test models at the absolute frontier of human knowledge and require genuine reasoning capabilities rather than simple factual recall.

No results tracked yet

Related Tasks

World Models

World models are internal, learned representations in AI that function like a "computational snow globe," allowing an agent to understand its environment, predict future states, and simulate the outcomes of actions before acting in the real world. They are essential for building sophisticated AI systems that can reason, make decisions, and interact with complex environments by simulating dynamics like physics, motion, and spatial relationships.

Omni models

Omni models are AI models that take multiple modalities (language, vision, audio) as input and produce multiple modalities as output. Some examples of the first omni models include [Qwen2.5 Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) and [BAGEL](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT).

Video-Language Models

Video Language Models (Video LLMs) are advanced AI systems that combine large language models with video processing capabilities to understand and generate descriptive content from videos. They bridge the gap between visual and textual information by using special encoders to convert video data into a format that a standard text-based large language model (LLM) can process, enabling tasks like video analysis, content generation, and question answering about video content.

Coding Agents

Coding agents are autonomous, AI-powered software development tools that understand natural language prompts and execute multi-step tasks to automate coding, bug fixing, and entire software workflows. They act as intelligent assistants within the software development lifecycle, capable of understanding code, generating new code, optimizing existing code, debugging, and handling tasks like documentation and feature scaffolding with minimal user supervision. Examples of coding agents include Claude Code and Cursor Agent.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep General benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to General