Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->

This day in AI

Recent paper calendar for people tracking useful AI shifts.

CodeSOTA scans arXiv days for papers that could become useful benchmark rows, model-selection evidence, or product-facing research notes. This page keeps the daily trail visible instead of burying it in one-off reports.

The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.

Big picture

  • Runtime agent safety is moving from prompt policies to action interception, MCP monitoring, and host-side controls.
  • Evaluation is shifting toward process-aware tasks: tool trajectories, delivered artifacts, multimodal verification, and human-validated rubrics.
  • Self-improving agents are becoming governed systems with rollback, canary tests, experience graphs, and explicit lifecycle controls.
  • Reasoning work is converging on sparse credit assignment: find the decision tokens or reasoning steps that actually steer the answer.

Benchmarks to extract

  • TOBench for tool-using agent rows
  • ADR-Bench and SLEIGHT-Bench for agent security
  • WebGameBench for coding-agent delivery
  • LinAlg-Bench, CAM-Bench, and GIM for reasoning diagnostics

Papers and links

Benchmark2605.16909

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

100 executable tasks, 27 MCP servers, 324 tools, and closed-loop multimodal verification for end-to-end tool use.

Safety2605.17380

ADR: An Agentic Detection and Response System

Production-style monitoring for MCP agent activity with ADR-Bench covering 302 tasks and 17 attack techniques.

Benchmark2605.17637

WebGameBench: Requirement-to-Application Evaluation for Coding Agents

Evaluates browser-accessible delivered games, separating minimum working delivery from excellent requirement satisfaction.

Benchmark2605.16675

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes

660 SymPy-certified linear algebra problems plus a failure taxonomy for diagnosing mathematical reasoning.

Reasoning2605.16874

Reasoning Can Be Restored by Correcting a Few Decision Tokens

Claims reasoning failures concentrate in a small number of early tokens, useful for intervention and evaluation design.

Agent2605.17721

EXG: Self-Evolving Agents with Experience Graphs

Turns successes and failures into graph memory, giving self-evolving agents a more inspectable substrate.

Method note

Full arXiv /new batch collected on May 19. The LLM scout covered all new submissions; deterministic benchmark detection covered the full batch. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.

Big picture

  • Real-world web agents are getting eval environments that look more like actual SaaS and commerce workflows.
  • Agent architecture work is becoming cost-aware: context, hierarchy, reasoning depth, and monitoring are treated as budgeted design choices.
  • Formal methods are appearing as runtime guardrails for LLM systems rather than only offline verification work.
  • Medical and robotics papers are packaging open systems around concrete downstream workflows instead of generic model releases.

Benchmarks to extract

  • ShopGym and SaaS-Bench for web-agent task pages
  • PAGER/PAGE Bench for long-form or page-level agent evaluation
  • ToxiAlert-Bench and RoadmapBench from the deterministic screen
  • VLA-AD and RTL-BenchMT for embodied and hardware-facing rows

Papers and links

Benchmark2605.15777

SaaS-Bench

A practical benchmark direction for agents operating across SaaS workflows, useful for procurement-style agent evaluation.

Benchmark2605.16116

ShopGym

High-priority e-commerce agent environment surfaced by both the scout and deterministic screen.

Safety2605.16198

Formal Methods Meet LLMs

Runtime monitoring and formal constraints as a concrete control layer for LLM agents.

Agent2605.16217

Argus Deep Research Agents

Deep-research agent work that matches CodeSOTA's interest in paper-to-evidence workflows.

System2605.16215

Fully Open Meditron

Open medical AI system surfaced as a practical extraction target for model, data, and benchmark claims.

Method note

The dated May 18 arXiv /recent section was collected in full. We stopped the full LLM run after your subsample instruction and used a 60-paper scout plus deterministic screening across all 200 summaries. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.

Big picture

  • Agent orchestration papers are moving toward explicit graphs, patterns, and inspectable memory instead of one-off prompt chains.
  • Education and industrial benchmarks are becoming stronger examples of domain-specific agent evaluation.
  • Safety and governance papers are trying to separate the task being performed from the governance process wrapped around it.
  • Reasoning papers continue to probe symbolic structure, attributes, and limits of model-based inference.

Benchmarks to extract

  • EntityBench, ClawForge, EduAgentBench, and Herculean
  • PDI-Bench, Collider-Bench, and XDomainBench
  • EduFrameTrap for sycophancy and education-agent failure modes
  • SimPersona for persona or simulation-agent evaluation

Papers and links

Safety2605.14744

Governance-task decoupling

Useful framing for separating operational task success from oversight and governance quality.

Method note

This uses the existing May 15 batch and reports already in the local paper pipeline. Claims here are abstract/report level until tables are extracted from individual PDFs. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Next extraction pass

Turn the calendar into benchmark evidence, not just reading notes.

The valuable follow-up is to pull benchmark tables, model lists, task definitions, and failure taxonomies from the strongest papers. That gives CodeSOTA rows users can compare, not merely links they can browse.