Codesota · Registry · Agentic AIThe area-level registerIssue: April 22, 2026
Area hub · Agentic AI

Agents,
under test.

Tool-use, multi-step workflows, web agents. The frontier that demos best and produces the widest gap between benchmark scores and production reliability.

Agentic AI evolved from experimental prototypes to production enterprise systems in 2025. Major platforms from Anthropic, Google, Microsoft, and OpenAI achieved 50-70% success on real-world coding tasks, yet the gap between benchmark performance and production reliability remains substantial. Success depends less on raw model scale and more on orchestration, error handling, and human oversight.

§ 01 · Top tasks

Sub-tasks in agentic ai.

Each task opens onto a leaderboard of its canonical benchmark, with the full submission history and dated scores. Tasks without an indexed result are listed elsewhere in the register; the table below is sorted by result count.

Fig 01 · Showing top 10 of 10 tasks under Agentic AI.
§ 02 · Top benchmarks

Current state of the art.

Leading scores for the headline benchmarks in this area, drawn from the registry. Shaded rows mark the top result per task; follow any row into the full leaderboard.

#TaskBenchmarkLeading modelScore
01Task agentsAcademiClaw: agentic frontier tasks benchmarkGemini 3.1 Pro2857.0
avg-tokens-per-task-k
02SWE-benchSWE-bench Verified — Agentic LeaderboardClaude Mythos Preview93.90
resolve-rate
03Autonomous CodingTerminal-Bench 2.0Codex / GPT-5.582.0%
accuracy
04Tool UseTau2-Bench: Agentic Tool-Use BenchmarkClaude Opus 4.579.00
pass_rate
05Web & Desktop AgentsWebArena: A Realistic Web Environment for Building Autonomous AgentsAgent-E (GPT-4o)73.00
success-rate
06Time HorizonMETR Autonomy Evaluation: Time HorizonClaude Opus 460.00
task-horizon-minutes
07HCASTHuman-Calibrated Autonomy Software TasksClaude Opus 455.00
success-rate
08Bioinformatics AgentsBixBench: a Comprehensive Benchmark for LLM-based Agents in Computational BiologyGPT-4o17.0%
accuracy
09RE-BenchResearch Engineering Benchmarko30.380
normalized-score
Fig 02 · Headline benchmarks for Agentic AI. Full leaderboards, dated history and reproduction status live on the task pages.
Side note

State of the Field (2025)

  • 01Leading models: Gemini 3 Pro (76.2% SWE-bench), Claude 3.5 Sonnet (49%), o3 (87% GPQA Diamond) demonstrate PhD-level expertise on academic benchmarks
  • 02Key benchmarks: SWE-bench for coding agents, GAIA for multi-capability reasoning, AgentBench for interactive decision-making, Terminal-Bench for operational workflows, and AMB-style suites for persistent agent memory
  • 03Integration standards: Model Context Protocol (MCP) and Agent-to-Agent (A2A) emerged as universal protocols enabling tool connectivity and multi-agent coordination across platforms
  • 04Reality check: 62% of enterprises experimenting with agents, but most remain in pilot phase. Hybrid human-AI teams outperform autonomous agents by 69% despite being slower and more expensive
Picks by use-case

What to reach for.

Editorial picks · not vendor rankings
Production Coding Agents
Gemini 3 Flash or Claude 3.5 Sonnet

Gemini 3 Flash balances strong performance (competitive with Gemini 3 Pro on many tasks) with low cost and latency. Claude 3.5 Sonnet offers 49% SWE-bench with minimal scaffolding requirements. Reserve Gemini 3 Pro (76.2% SWE-bench) for genuinely complex tasks.

Multi-Agent Orchestration
Microsoft AutoGen or LangGraph

AutoGen excels at multi-agent collaboration with strong team coordination features. LangGraph provides explicit state management for complex workflows. Both offer production-grade observability and enterprise deployment patterns.

Mathematical and Scientific Reasoning
OpenAI o3 or o4-mini

o3 achieved 87% on GPQA Diamond (exceeds PhD experts). o4-mini delivers 99.5% on AIME 2025 with tool use at fraction of o3's cost. Inference-time scaling enables variable compute allocation based on problem difficulty.

Cost-Constrained Deployments
Llama 3.3 (70B) or Qwen 3

Open-weight models now within 1.7% of proprietary systems on benchmarks. Enable local deployment, avoid vendor lock-in, support custom fine-tuning. Llama 3.3 and Qwen 3 offer strong reasoning with full control over infrastructure.

Enterprise Integration
Google ADK or Anthropic Claude + MCP

Google ADK provides enterprise-grade infrastructure with tight Vertex AI integration. Anthropic's Model Context Protocol (MCP) offers universal tool connectivity across platforms. Both support governance, compliance, and security requirements for regulated industries.

Persistent agent memory
Track AMB first; treat Audrey artifacts as supporting evidence

Use Agent Memory Benchmark for comparable provider claims. Audrey's public benchmark report and raw artifacts are useful local deterministic regression/performance evidence, but not a CodeSOTA leaderboard score until an official AMB harness run exists.

Domain-Specific Applications
Fine-tuned Llama 3.3 or Mistral Large

Domain specialization (finance, healthcare, legal) justifies 8-12 week fine-tuning investment. Llama 3.3 provides strong foundation for customization. Mistral Large offers European data residency for GDPR compliance.

Rapid Prototyping
OpenAI Agents SDK or Anthropic Claude

OpenAI SDK offers simplest implementation path with strong GPT integration. Claude provides excellent documentation and developer experience. Both enable fast iteration without framework complexity.

High-Volume Automation
Tiered routing: Gemini 3 Flash -> Claude 3.5 Sonnet -> Gemini 3 Pro

Route simple tasks to fast, cheap models. Escalate complex cases to premium models. This optimization balances cost and success rate for high-volume deployments where uniform premium model use proves economically infeasible.

Editor's note

Honest takes.

Benchmarks lie about production readiness

Models achieving 70%+ on SWE-bench still hallucinate confidently in multi-step workflows. Academic benchmarks optimize single-task accuracy while ignoring cost, latency, error propagation, and security. Production success requires guardrails, observability, and human oversight that benchmarks don't measure.

Most agents don't need reasoning models

o3 and similar reasoning models deliver impressive results but cost 5-10x standard models. For 80% of enterprise use cases, Gemini 3 Flash or Claude 3.5 Sonnet provide better cost-performance. Save reasoning models for genuinely hard problems, route simple tasks to efficient models.

Frameworks are overrated, infrastructure matters more

LangChain, AutoGen, and CrewAI provide value but teams often succeed with minimal scaffolding. The hard parts are observability, guardrails, memory management, and human-in-loop workflows. Don't let framework complexity distract from production fundamentals.

Memory needs benchmarks before marketing claims

Agent memory should be tested on recall, update/delete behavior, contradictions, valid-time beliefs, and evidence-backed preflight decisions. Local artifacts are useful regression evidence, but provider leaderboards need a shared harness such as Agent Memory Benchmark before scores are comparable.

Full autonomy is a trap for high-stakes decisions

Research shows human-AI collaboration beats pure automation by 69% despite being slower. For consequential decisions (contracts, customer communication, financial transactions), keep humans in approval loops. Speed optimization that sacrifices accuracy destroys business value.

Domain-specific beats general-purpose for production

Generic models adapted to specific contexts consistently underperform domain-specialized agents. Finance, healthcare, and legal applications justify 8-12 week fine-tuning investments. Half of enterprise AI will be domain-specific by 2028.

§ 03 · Method
How this area is tracked

Every row in this register is dated and sourced.

The benchmarks above come from the same Postgres registry that powers the wider Codesota index. Each task has exactly one canonical dataset. Each score carries a metric direction, a date and — where possible — a reproduction status.

When a score regresses, the prior record stays visible. When a benchmark is contested, we mark it rather than delete it. The goal is a register that argues in public.

Full methodology The unified task index
§ Final · Related

Neighbouring registers.

Sibling area hubs, the unified task index and the methodology that binds them.