Codesota · Tasks · Task agentsHome/Tasks/Agentic AI/Task agents

Task agents.

AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.

9
Datasets
45
Results
acc-tau-0-33
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

Collider-Bench

Benchmark for autonomous coding/scientific agents reproducing Large Hadron Collider analyses. Public CodeSOTA score is Acc_tau at tau=0.33: the percent of simulation tasks whose relative-L2 error is below 0.33, derived from Table 2 and Eq. 4 of arXiv:2605.13950.

Primary metric: acc-tau-0-33
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on Collider-Bench.

#Modelacc-tau-0-33YearSource
Codex CLI (GPT-5.5)30.02026paper ↗
2Claude Code (Opus 4.7)20.02026paper ↗
3Claude Code (Sonnet 4.6)10.02026paper ↗
4Claude Code (Haiku 4.5)0.0002026paper ↗
5Codex CLI (GPT-5.4-mini)0.0002026paper ↗
6ForgeCode (DeepSeek-V4)0.0002026paper ↗

What were you looking for on Task agents?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

9 datasets tracked for this task.

Collider-Bench
CANONICAL
6 results · acc-tau-0-33
Top: Codex CLI (GPT-5.5) 30.0
AcademiClaw
35 results · avg-score
Top: Gemini-3.1-Pro 2857
MedMemoryBench
2 results
Top: Letta 41.5
PhysicianBench
2 results
Top: GPT-5.5 46.3
BFCL
0 results
Nexus
0 results
TauBench (airline)
0 results
TauBench (retail)
0 results
Terminal Bench
0 results
§ 05 · Related tasks

Other tasks in Agentic AI.

Agent MemoryAutonomous CodingBioinformatics AgentsHCASTRE-BenchSWE-benchTime HorizonTool Use
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Task agents? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.