TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
100 executable tasks, 27 MCP servers, 324 tools, and closed-loop multimodal verification for end-to-end tool use.
This day in AI
CodeSOTA scans arXiv days for papers that could become useful benchmark rows, model-selection evidence, or product-facing research notes. This page keeps the daily trail visible instead of burying it in one-off reports.
Calendar
Tuesday
The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.
900 entries, including 142 new submissions
Monday
Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.
200 entries, sampled LLM scout plus full deterministic screen
Friday
Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.
445 entries, including 100 new submissions
The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.
Big picture
Benchmarks to extract
Papers and links
100 executable tasks, 27 MCP servers, 324 tools, and closed-loop multimodal verification for end-to-end tool use.
Production-style monitoring for MCP agent activity with ADR-Bench covering 302 tasks and 17 attack techniques.
Evaluates browser-accessible delivered games, separating minimum working delivery from excellent requirement satisfaction.
660 SymPy-certified linear algebra problems plus a failure taxonomy for diagnosing mathematical reasoning.
Claims reasoning failures concentrate in a small number of early tokens, useful for intervention and evaluation design.
Turns successes and failures into graph memory, giving self-evolving agents a more inspectable substrate.
Method note
Full arXiv /new batch collected on May 19. The LLM scout covered all new submissions; deterministic benchmark detection covered the full batch. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.
Big picture
Benchmarks to extract
Papers and links
A practical benchmark direction for agents operating across SaaS workflows, useful for procurement-style agent evaluation.
High-priority e-commerce agent environment surfaced by both the scout and deterministic screen.
Agent architecture signal around designing and evaluating practical multi-step systems.
Runtime monitoring and formal constraints as a concrete control layer for LLM agents.
Deep-research agent work that matches CodeSOTA's interest in paper-to-evidence workflows.
Open medical AI system surfaced as a practical extraction target for model, data, and benchmark claims.
Method note
The dated May 18 arXiv /recent section was collected in full. We stopped the full LLM run after your subsample instruction and used a 60-paper scout plus deterministic screening across all 200 summaries. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.
Big picture
Benchmarks to extract
Papers and links
Agent orchestration and reliability signal around graph-structured coordination.
Agent memory cold-start work; useful for understanding reusable memory in deployed agents.
High-priority benchmark candidate from the Friday benchmark report.
Education-agent evaluation target with obvious CodeSOTA task-page value.
Entity-centered benchmark surfaced as a strong extraction candidate.
Useful framing for separating operational task success from oversight and governance quality.
Method note
This uses the existing May 15 batch and reports already in the local paper pipeline. Claims here are abstract/report level until tables are extracted from individual PDFs. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Next extraction pass
The valuable follow-up is to pull benchmark tables, model lists, task definitions, and failure taxonomies from the strongest papers. That gives CodeSOTA rows users can compare, not merely links they can browse.