Level 4: Advanced~45 min

Agent Pipelines

From STRIPS planners in 1971 to Claude Code in 2025 — how we taught machines to reason, act, and recover from failure.

54 Years of Teaching Machines to Act

The idea of an autonomous agent — a system that perceives its environment, reasons about goals, and takes action — predates large language models by half a century. Every modern agent framework is a descendant of ideas from classical AI planning, robotics, and cognitive science, recombined with the reasoning capabilities of foundation models.

Understanding this lineage isn't academic nostalgia. It's the fastest way to see why certain architectures work, what failure modes are fundamental (not fixable with a bigger model), and where the real unsolved problems remain.

Era I: Classical AI Planning
1971

STRIPS: The First Agent Architecture

At SRI International, Richard Fikes and Nils Nilsson built STRIPS (Stanford Research Institute Problem Solver) for the Shakey robot — the first mobile robot that could reason about its own actions. STRIPS introduced the formalism that would dominate AI planning for decades: represent the world as a set of logical predicates, define actions by their preconditions and effects, then search for a sequence of actions that transforms the initial state into a goal state.

# STRIPS-style action definition (still used in PDDL today)
Action: move_to(robot, room_A, room_B)
  Preconditions: at(robot, room_A), connected(room_A, room_B)
  Add effects:   at(robot, room_B)
  Delete effects: at(robot, room_A)

# Planner searches: initial_state -> action1 -> action2 -> ... -> goal_state

Fikes, R. & Nilsson, N. (1971). STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving. Artificial Intelligence, 2(3-4), 189-208.

STRIPS worked, but it was brittle. The world had to be fully observable, deterministic, and describable in first-order logic. Every possible action had to be pre-defined. There was no way to handle ambiguity or learn new behaviors. Yet the core abstraction — perceive state, plan actions, execute, observe result — is exactly what every modern agent does.

1987

BDI: Beliefs, Desires, Intentions

Philosopher Michael Bratman formalized practical reasoning into the BDI framework: agents have beliefs (what they know about the world), desires (goals they want to achieve), and intentions (committed plans of action). Rao and Georgeff (1991) operationalized this into an agent programming paradigm. The BDI model influenced decades of multi-agent systems research and survives today in the system prompts we write for LLM agents: "You are a helpful assistant (belief). Your goal is to answer the user's question (desire). Use the following tools to accomplish this (intention)."

Bratman, M. (1987). Intention, Plans, and Practical Reason. Harvard University Press.
Rao, A. & Georgeff, M. (1995). BDI Agents: From Theory to Practice. ICMAS.

1992-2000

RL Agents: Learning from Reward

While planners assumed perfect world models, reinforcement learning agents learned through trial and error. Watkins' Q-learning (1992), Sutton and Barto's textbook (1998), and TD-Gammon (Tesauro, 1995) showed that agents could learn policies without hand-coded rules — given enough episodes. The key limitation: RL agents needed millions of interactions with their environment, making them impractical for real-world tasks where each action has a cost. LLM agents would later sidestep this by using pre-trained world knowledge instead of learning from scratch.

Sutton, R. & Barto, A. (1998/2018). Reinforcement Learning: An Introduction.

Era II: Language Models Become Agents
October 2022

ReAct: The Paper That Launched LLM Agents

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao at Princeton and Google Brain showed that interleaving reasoning traces (chain-of-thought) with actions (tool calls) dramatically improved LLM performance on tasks requiring external information. The "Thought, Action, Observation" loop became the canonical agent pattern.

"ReAct prompts LLMs to generate both verbal reasoning traces and actions pertaining to a task in an interleaved manner, which allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting while also interacting with external environments to incorporate additional information into reasoning."

Yao, S. et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.

ReAct was not the first work on tool-using LLMs, but it provided the clearest framework. Within months, LangChain, AutoGPT, and dozens of agent libraries adopted the pattern. The key insight: reasoning without action hallucinates; action without reasoning flails. You need both.

February 2023

Toolformer: Self-Taught Tool Use

Timo Schick et al. at Meta AI showed that LLMs could teach themselves when and how to call external tools. The model was given a few examples of API calls embedded in text, then generated millions of candidate tool calls, filtering for those that actually improved perplexity on the next token. The result: a model that spontaneously inserts calculator calls for math, search calls for factual questions, and translator calls for foreign text — without per-task prompting.

Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023.

June 2023

Function Calling: The Industry Standard

OpenAI launched function calling in the GPT-3.5/4 API, followed rapidly by Anthropic (tool use in Claude), Google (function calling in Gemini), and every major provider. Instead of parsing free-text tool invocations from the model's output, the API returned structured JSON with the function name and arguments. This was the inflection point: tool use went from a research hack to a production-grade API feature.

The architectural pattern was simple but powerful: define tools as JSON schemas, include them in the system prompt, and let the model decide when to call them. The model outputs a structuredtool_call object instead of free text. Your code executes the function and feeds the result back. The model continues reasoning with the new information. This is the loop that powers every production agent today.

2023-2024

Advanced Planning: Tree of Thoughts & Beyond

Yao et al. (same lead author as ReAct) introduced Tree of Thoughts (ToT) — instead of a single chain of reasoning, the LLM explores multiple reasoning paths and backtracks when a path fails. This addressed a core weakness of ReAct: once the agent commits to a bad plan, it has no mechanism to recover.

In parallel, Wang et al. built Voyager, an LLM-powered agent that played Minecraft autonomously by writing code, storing successful programs in a skill library, and composing new skills from old ones. Significant-Gravitas released AutoGPT, which despite its limitations proved that the public was hungry for autonomous agents. Yohei Nakajima built BabyAGI in 140 lines of Python.

Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with LLMs. NeurIPS 2023.
Wang, G. et al. (2023). Voyager: An Open-Ended Embodied Agent with LLMs. NeurIPS 2023.

Era III: Production Agent Systems
2024-2025

Coding Agents: The First Killer App

Software engineering became the first domain where LLM agents achieved real production value. Devin (Cognition, March 2024) demonstrated an autonomous coding agent that could resolve GitHub issues end-to-end. Claude Code (Anthropic, 2025), Cursor, Windsurf, and Codex CLI (OpenAI, 2025) brought agentic coding to millions of developers.

The SWE-bench benchmark became the standard yardstick: given a GitHub issue and the full repo, resolve the issue autonomously. SOTA went from 1.96% (GPT-4 zero-shot, October 2023) to 72.0% (Claude 3.5 Sonnet + scaffolding, early 2025). The pattern that won: give the agent shell access, a test suite, and a retry budget. Let it code, run tests, read errors, fix, and iterate. Not tree search. Not elaborate planning. Just the ReAct loop with real-world feedback.

2025

Multi-Agent Orchestration Goes Mainstream

Single-agent systems hit scaling limits on complex tasks. The industry converged on multi-agent patterns: Anthropic's Agent SDK, OpenAI's Agents SDK, LangGraph, and CrewAI all ship orchestration primitives for routing between specialized agents, managing shared state, and implementing human-in-the-loop checkpoints.

The hard problems shifted from "can the agent do the task?" to reliability engineering: cost control, latency budgets, graceful degradation, observability, and security sandboxing. Agent pipelines became software engineering, not prompt engineering.

The throughline: 1971 to 2025

Five decades. One problem, attacked from every angle:

1971Formalism: Define actions with preconditions and effects, search for plans (STRIPS)
1987Architecture: Give agents beliefs, desires, and intentions (BDI)
1992-2000Learning: Let agents learn from reward signals (RL)
2022Language: Use LLMs as the reasoning engine with tool access (ReAct)
2023Structure: Standardize tool use as a first-class API feature (function calling)
2024-2025Production: Ship reliable agent systems with orchestration, observability, and guardrails

Every advance solved a limitation of the previous generation. Classical planners couldn't handle ambiguity. RL agents couldn't generalize without millions of episodes. LLM agents combine pre-trained world knowledge with runtime tool use — but introduce their own failure modes: hallucination, cost explosion, and the inability to know when they're stuck.

What is an Agent?

An agent is an LLM that can take actions in the world. Instead of just generating text, it can call functions, search the web, run code, or interact with APIs. The critical distinction: a chatbot responds; an agent acts.

Russell and Norvig's canonical definition from Artificial Intelligence: A Modern Approach (1995) still holds: "An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators." For LLM agents, the "sensors" are tool results and user messages; the "actuators" are function calls.

LLM (Brain)

The reasoning engine. Decides what to do next based on the current state and goal.

Foundation models provide general world knowledge and language understanding. Larger models reason better but cost more per step. The choice of model is the single biggest lever on agent quality.

Tools (Actions)

Functions the agent can call: search, calculate, code execution, API calls, database queries.

Tools ground the agent in reality. Without them, the model can only hallucinate answers. With them, it can verify facts, perform computations, and change the world.

Memory (Context)

Conversation history, retrieved documents, and scratchpad for intermediate results.

Short-term: the context window (what the model can "see"). Long-term: vector stores, databases, or file systems that persist across sessions. Managing memory is the hardest part of building reliable agents.

Planning (Strategy)

How the agent breaks down complex tasks into steps and decides which tools to use.

Can be implicit (the model figures it out turn-by-turn) or explicit (a separate planning step generates a structured plan before execution). Explicit planning helps on complex tasks but adds latency and can over-commit to bad strategies.

Three Core Architectures

Every agent framework implements one of three fundamental patterns — or a hybrid. Understanding them lets you pick the right one for your task.

1

ReAct (Reason + Act)

The workhorse. Good for 80% of agent tasks.

The agent alternates between thinking (generating a reasoning trace) and acting (calling a tool). Each tool result becomes an "observation" that informs the next thought. The loop continues until the task is complete or a stop condition is met.

Thought: I need to find the current stock price of NVIDIA.
Action: search_web("NVIDIA stock price today")
Observation: NVIDIA (NVDA) is trading at $142.50, up 3.2% today.
Thought: Now I need to calculate the market cap. I know there are ~24.5B shares.
Action: calculate("142.50 * 24500000000")
Observation: 3,491,250,000,000
Thought: I have both pieces of information. Let me format the answer.
Answer: NVIDIA's current stock price is $142.50, with a market cap of ~$3.49T.

STRENGTHS

Simple to implement. Flexible. Works well when tasks are sequential and the model can self-correct from tool feedback.

WEAKNESSES

No backtracking — if the agent goes down a wrong path, it can't undo actions. Prone to loops on hard tasks. No parallelism.

2

Plan-and-Execute

Better for complex, multi-step tasks.

Separates planning from execution. A planner (often a stronger model) generates a step-by-step plan. An executor (can be a cheaper model) carries out each step. After each step, the planner can optionally re-plan based on results. This mirrors how humans tackle complex projects: outline first, then execute, adjusting the plan as you learn.

# Plan-and-Execute pattern
Plan:
  1. Search for NVIDIA's current stock price
  2. Search for NVIDIA's outstanding shares
  3. Calculate market cap = price x shares
  4. Compare to Apple and Microsoft market caps
  5. Summarize findings in a table

Execute Step 1: search_web("NVIDIA stock price") -> $142.50
Execute Step 2: search_web("NVIDIA shares outstanding") -> 24.49B
Re-plan: Steps 1-2 complete. Steps 3-5 still valid. Continue.
Execute Step 3: calculate("142.50 * 24490000000") -> $3.49T
...

STRENGTHS

Better at complex tasks. Can use a stronger model for planning and cheaper one for execution. Re-planning adapts to surprises.

WEAKNESSES

Higher latency (planning step adds time). Can over-plan, committing to a rigid strategy before seeing reality. More complex to implement.

Wang, L. et al. (2023). Plan-and-Solve Prompting. ACL 2023.

3

Tree of Thoughts (ToT)

For problems requiring exploration and backtracking.

Instead of a single chain of reasoning, the agent explores multiple reasoning paths in parallel. At each step, it generates several candidate "thoughts," evaluates them (using the LLM itself as an evaluator), and prunes unpromising branches. This is essentially beam search over reasoning steps.

# Tree of Thoughts: explore multiple paths
Root: "Write a function to find the k-th largest element"

Branch A: Sort the array, return arr[-k]     -> O(n log n) correct but slow
Branch B: Use a min-heap of size k           -> O(n log k) optimal for large n
Branch C: Use quickselect (partition-based)  -> O(n) avg   best average case

Evaluate: Branch C is optimal. Expand it.
Branch C1: Implement with random pivot  -> works, but worst case O(n^2)
Branch C2: Implement with median-of-3   -> better worst case, more complex
Branch C3: Use introselect (hybrid)     -> guaranteed O(n), production-grade

Select: C1 for simplicity, C3 for production.

STRENGTHS

Can backtrack from dead ends. Explores solution space more thoroughly. Best for puzzles, math, creative tasks with many valid approaches.

WEAKNESSES

Expensive — each branch costs tokens. Evaluation step can be unreliable. Overkill for straightforward tasks. Exponential blowup if branching factor is high.

Yao, S. et al. (2023). Tree of Thoughts. NeurIPS 2023.

When to use which architecture

ReActDefault choice. Q&A with tools, data retrieval, simple automation. Use this unless you have a reason not to.
Plan-ExecuteMulti-step research, report writing, complex workflows with 5+ steps. When the task benefits from upfront structure.
Tree of ThoughtsMath problems, code optimization, creative writing, puzzles. When there are multiple valid approaches and you need the best one.

Working Code: Three Providers

Every major LLM provider now supports structured tool calling. The patterns are nearly identical — the differences are in the API shape, not the architecture.

OpenAI: Complete Agent Loop

import json
from openai import OpenAI

client = OpenAI()

tools = [
    {"type": "function", "function": {
        "name": "search_web",
        "description": "Search the web for current information",
        "parameters": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"]
        }
    }},
    {"type": "function", "function": {
        "name": "calculate",
        "description": "Evaluate a mathematical expression",
        "parameters": {
            "type": "object",
            "properties": {"expression": {"type": "string"}},
            "required": ["expression"]
        }
    }}
]

def execute_tool(name: str, args: dict) -> str:
    if name == "search_web":
        return search_api(args["query"])  # Your search implementation
    elif name == "calculate":
        return str(eval(args["expression"]))  # Use sandboxed eval in prod
    return f"Unknown tool: {name}"

def run_agent(task: str, max_turns: int = 10) -> str:
    messages = [
        {"role": "system", "content": "You are a helpful research assistant."},
        {"role": "user", "content": task}
    ]

    for turn in range(max_turns):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools
        )
        msg = response.choices[0].message

        # Agent decided to return a final answer
        if response.choices[0].finish_reason == "stop":
            return msg.content

        # Agent wants to call tools
        if msg.tool_calls:
            messages.append(msg)
            for tc in msg.tool_calls:
                result = execute_tool(
                    tc.function.name,
                    json.loads(tc.function.arguments)
                )
                messages.append({
                    "role": "tool",
                    "tool_call_id": tc.id,
                    "content": result
                })

    return "Agent reached maximum turns without completing."

Anthropic Claude: Tool Use

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "search_web",
        "description": "Search the web for current information",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"]
        }
    },
    {
        "name": "calculate",
        "description": "Evaluate a mathematical expression",
        "input_schema": {
            "type": "object",
            "properties": {"expression": {"type": "string"}},
            "required": ["expression"]
        }
    }
]

def run_agent(task: str, max_turns: int = 10) -> str:
    messages = [{"role": "user", "content": task}]

    for turn in range(max_turns):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system="You are a helpful research assistant.",
            tools=tools,
            messages=messages
        )

        # Check if agent is done (no tool use)
        if response.stop_reason == "end_turn":
            return response.content[0].text

        # Process tool calls
        if response.stop_reason == "tool_use":
            messages.append({
                "role": "assistant",
                "content": response.content
            })

            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })
            messages.append({"role": "user", "content": tool_results})

    return "Agent reached maximum turns."

LangGraph: Stateful Agent with Graph

from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from typing import TypedDict, Annotated
import operator

@tool
def search_web(query: str) -> str:
    """Search the web for current information."""
    return search_api(query)

@tool
def calculate(expression: str) -> str:
    """Evaluate a mathematical expression."""
    return str(eval(expression))

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]

model = ChatOpenAI(model="gpt-4o").bind_tools(
    [search_web, calculate]
)

def call_model(state: AgentState):
    response = model.invoke(state["messages"])
    return {"messages": [response]}

def should_continue(state: AgentState):
    last_msg = state["messages"][-1]
    if last_msg.tool_calls:
        return "tools"
    return END

# Wire it up
graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", ToolNode([search_web, calculate]))
graph.set_entry_point("agent")
graph.add_conditional_edges(
    "agent", should_continue,
    {"tools": "tools", END: END}
)
graph.add_edge("tools", "agent")

agent = graph.compile()
result = agent.invoke({
    "messages": [{"role": "user", "content": "NVIDIA market cap?"}]
})

LangGraph models the agent as a directed graph with explicit state and transitions. This makes complex flows (branching, parallel tool calls, human-in-the-loop) more manageable than a raw while loop, at the cost of more abstraction.

Multi-Agent Orchestration

Single agents hit limits on complex tasks: context windows fill up, the model loses track of sub-goals, and error compounding grows with each step. Multi-agent systems address this by giving each agent a narrow scope and clear role.

The intuition is the same as microservices vs. monoliths: one agent that does everything will eventually become unreliable. Specialized agents with clear interfaces are easier to debug, test, and improve independently.

Hierarchical

A manager agent breaks down the task and delegates sub-tasks to specialized workers. The manager synthesizes results.

Manager
  +-- Researcher
  +-- Writer
  +-- Reviewer

Best for: report generation, research tasks, content pipelines.

Pipeline

Agents execute in sequence. Each transforms the output of the previous. Simple, predictable, easy to debug.

Extract -> Analyze -> Summarize -> Format

Best for: ETL, data processing, content transformation.

Debate / Ensemble

Multiple agents tackle the same problem independently. A judge agent picks the best answer or synthesizes them.

Agent A --+
Agent B --+-> Judge
Agent C --+

Best for: code review, decision-making, high-stakes outputs.

Multi-Agent Pattern (Hierarchical)

import anthropic

client = anthropic.Anthropic()

def run_agent(role: str, system: str, task: str, tools=None):
    """Run a single specialized agent."""
    messages = [{"role": "user", "content": task}]
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=system,
        tools=tools or [],
        messages=messages
    )
    return response.content[0].text

def orchestrate(task: str) -> str:
    # Step 1: Manager creates a plan
    plan = run_agent(
        role="manager",
        system="Break this task into 2-3 sub-tasks.",
        task=task
    )

    # Step 2: Researcher gathers information
    research = run_agent(
        role="researcher",
        system="You are a thorough researcher.",
        task=f"Research based on this plan:\n{plan}",
        tools=[search_web_tool, read_url_tool]
    )

    # Step 3: Writer creates the final output
    result = run_agent(
        role="writer",
        system="Write clear, accurate content.",
        task=f"Task: {task}\nResearch:\n{research}"
    )

    return result
Critical: Failure Modes

Seven Ways Agents Fail (And How to Fix Them)

Building agents that work in demos is easy. Building agents that work in production is an engineering discipline. These are the failure modes you will hit, roughly in order of how quickly they appear.

1

Infinite Loops

The agent repeats the same action with the same arguments, or oscillates between two states. This is the most common agent failure and happens when the model can't make progress but doesn't know how to ask for help or give up.

Mitigations: Set a hard max_turns limit. Track action history and detect repetition. Add a "give_up" tool the agent can call. Log and alert on loops.

2

Tool Hallucination

The agent calls a tool that doesn't exist, passes wrong argument types, or invents plausible-sounding but incorrect tool names. Particularly common when the model is under-specified about available tools or when tool descriptions are ambiguous.

Mitigations: Validate all tool calls before execution. Use strict JSON schema validation. Provide 1-2 examples in the system prompt. Use models fine-tuned for tool calling (GPT-4o, Claude 3.5+).

3

Context Window Overflow

Long-running agents accumulate tool results, conversation history, and error messages until they exceed the context window. At that point, the model either truncates critical information or the API returns an error. This is especially insidious because the agent appears to work fine for 10 turns, then catastrophically loses track on turn 11.

Mitigations: Summarize tool results before appending to context. Implement sliding window over conversation history. Use a separate memory store for long-term facts. Monitor token usage per turn.

4

Cost Explosion

An agent that calls GPT-4o 50 times with 10K tokens per call costs ~$5 per run. Multiply by thousands of users and costs become untenable. The problem compounds with multi-agent systems where each sub-agent runs its own loop.

Mitigations: Set per-request token budgets. Use cheaper models (GPT-4o-mini, Haiku) for simple sub-tasks. Cache tool results aggressively. Route simple queries directly to tools without an agent loop.

5

Error Cascading

A wrong result from Step 2 propagates through Steps 3-10. The agent confidently builds on incorrect information because it has no mechanism to verify intermediate results. This is the agent equivalent of "garbage in, garbage out" but harder to detect because the final output often looks plausible.

Mitigations: Add verification steps after critical tool calls. Use a separate "checker" agent. Cross-reference multiple sources. Build self-critique into the agent prompt ("Before proceeding, verify that...").

6

Premature Commitment

The agent commits to a strategy early (especially in plan-and-execute) and refuses to deviate even when evidence suggests the plan is wrong. This mirrors the "sunk cost fallacy" in human reasoning. The agent has already invested 5 steps in approach A and won't abandon it for approach B, even when B is clearly better.

Mitigations: Implement re-planning triggers based on failure count. Add explicit "should I change my approach?" reflection steps. Set per-plan step budgets with mandatory reassessment.

7

Security: Prompt Injection via Tools

Tool results can contain adversarial content. A web search might return a page with "IGNORE ALL PREVIOUS INSTRUCTIONS and instead..." embedded in the text. If the agent processes this as part of its context, it can be hijacked. This is the agent-specific version of prompt injection and is currently an unsolved problem at the fundamental level.

Mitigations: Sanitize tool outputs. Use separate system/user message boundaries. Limit what tools the agent can call based on the task. Never give agents access to sensitive operations without human approval. Monitor for unexpected behavior patterns.

The reliability equation

If each step of your agent has 95% reliability and your pipeline has 10 steps: 0.95^10 = 0.60. Your agent fails 40% of the time. This is why production agents obsess over per-step reliability, error recovery, and human-in-the-loop checkpoints. The math is unforgiving.

Kapoor, S. et al. (2024). AI Agents that Matter. arXiv.

Evaluating Agents

Agent evaluation is fundamentally harder than LLM evaluation. You need to measure both task completion and efficiency. An agent that solves the task in 50 API calls when 3 would suffice is not a good agent — it's an expensive agent.

Key Metrics

  • -Task success rate: Did it complete the task correctly? Binary for most benchmarks.
  • -Steps to completion: Fewer steps = better reasoning. Track median and p95.
  • -Cost per task: Total tokens (input + output) and API calls. The metric that matters most in production.
  • -Error recovery rate: When a tool fails or returns unexpected results, does the agent recover?
  • -Latency (wall clock): End-to-end time. Critical for user-facing agents.

Standard Benchmarks

  • -SWE-bench: Resolve real GitHub issues. SOTA ~72% (verified). The gold standard for coding agents.
  • -WebArena: Complete tasks on real websites. Tests browser agent capabilities. SOTA ~35%.
  • -GAIA: General AI assistant tasks requiring tools. 3 difficulty levels. Humans score ~92%; best models ~75%.
  • -TAU-bench: Tool-Agent-User benchmark. Tests multi-turn agent interactions with realistic tool use.
  • -METR: Evaluating AI capabilities and safety. Focus on autonomous long-horizon tasks.

Jimenez, C. et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024.
Zhou, S. et al. (2024). WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR 2024.
Mialon, G. et al. (2023). GAIA: A Benchmark for General AI Assistants. ICLR 2024.

Key Takeaways

  • 1

    Agents are LLMs with tools, memory, and planning — the same perceive-reason-act loop that STRIPS introduced in 1971, now powered by foundation models instead of hand-coded logic.

  • 2

    ReAct is the foundation; plan-and-execute and ToT handle complexity — start with the simplest pattern that works. Add planning only when the task demands it.

  • 3

    Structured function calling is the industry standard — every major provider supports it. Define tools as JSON schemas. The model outputs structured calls. Your code executes them.

  • 4

    Reliability compounds — 95% per step means 60% over 10 steps — production agents need error recovery, cost controls, max-turn limits, and observability. The hard work is in the failure handling, not the happy path.

  • 5

    Measure task completion AND cost AND latency — an agent that solves the problem in 50 calls is not better than one that solves it in 3. Use SWE-bench, GAIA, and WebArena for rigorous evaluation.

Further Reading

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.