This day in AI

Recent paper calendar for people tracking useful AI shifts.

CodeSOTA scans arXiv days for papers that could become useful benchmark rows, model-selection evidence, or product-facing research notes. This page keeps the daily trail visible instead of burying it in one-off reports.

Latest day June 9, 2026 Calendar index

Calendar

Research days worth revisiting

Open recent-paper calendar

Tuesday

June 9, 2026

Today's batch is dominated by new benchmarks for agentic systems, safety, and multimodal reasoning, alongside a strong signal for structured evaluation and self-evolving architectures.

509 entries

Monday

June 8, 2026

Agentic self-evolution, formal verification, safety evaluations, and new domain-specific benchmarks dominate Monday's batch, revealing growing rigor in measuring and improving autonomous systems.

164 entries

Friday

June 5, 2026

A dense crop of task-specific benchmarks and agent systems reveals that frontier models saturate narrow perception tests but still fail on long-horizon, stateful, and multi-modal safety reasoning, while new training and inference techniques show materials for closing those gaps.

280 entries

Thursday

June 4, 2026

Benchmarks and systems for agent safety, long-horizon code optimization, consequence-aware compute allocation, and structured skill learning dominate today's cs.AI submissions.

207 entries

Wednesday

June 3, 2026

Today's batch is dominated by rigorous new benchmarks that expose agent fragility in finance, medicine, and software engineering, alongside system papers that improve evaluation efficiency and safety.

440 entries, including 80 new submissions

Tuesday

June 2, 2026

Today's batch is dominated by new benchmarks that expose persistent gaps in agentic reasoning, tool-use safety, and domain-specific evaluation, while also introducing systems for automated benchmark evolution and step-level process diagnosis.

946 entries, including 173 new submissions

Wednesday

May 20, 2026

New benchmarks for agentic routing, long-horizon software development, memory evaluation, and healthcare workflows reveal persistent gaps in frontier models, while dynamic layer routing offers a path to more efficient LLM inference.

460 entries, including 70 new submissions

Tuesday

May 19, 2026

The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.

900 entries, including 142 new submissions

Monday

May 18, 2026

Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.

200 entries, sampled LLM scout plus full deterministic screen

Friday

May 15, 2026

Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.

445 entries, including 100 new submissions

Today's batch is dominated by new benchmarks for agentic systems, safety, and multimodal reasoning, alongside a strong signal for structured evaluation and self-evolving architectures.

Big picture

Agent benchmarks expand to long-horizon, hybrid-interface, and adversarial settings
Safety and alignment work targets emergent misalignment, compliance, and overconfidence
Structured evaluation frameworks for tabular, medical, and formal reasoning domains
Self-evolution and influence-guided training for reasoning improvement

Benchmarks to extract

WeaveBench: extract PassRate and trajectory-aware judge correlation on hybrid tasks
KernelBench: confirm attack success rate reduction from 62% to 0% after hacker-fixer loop
TRL-Bench: verify that generic text encoders lead on surface-text tasks, tabular specialists on aligned tasks
IMUG-Bench: confirm exposure bias gap between understanding and generation in multi-turn settings

Papers and links

Benchmark2606.09426

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

114 tasks requiring GUI+CLI+code orchestration; best PassRate 41.2%.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Safety2606.08960

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

Hacker-fixer loop drives attack success from 62% to 0% on KernelBench.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Reasoning2606.09052

INFUSER: Influence-Guided Self-Evolution Improves Reasoning

Co-training framework with DuGRPO; 20% relative gain on Olympiad/SuperGPQA.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2606.09323

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

20 models, 16 tasks; shows encoder quality is capability-specific.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2606.09169

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

3,113 samples across static, temporal, and hybrid tasks; exposes exposure bias.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

System2606.08106

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

Training-free commit gate reduces false commits from 30-42% to near zero.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Method note

Sampled 80 deterministic candidates from 509 cs.AI entries, prioritized by artifact names, benchmark language, and quantitative evidence, then selected 6 papers covering agent benchmarks, safety, reasoning, and structured evaluation. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Agentic self-evolution, formal verification, safety evaluations, and new domain-specific benchmarks dominate Monday's batch, revealing growing rigor in measuring and improving autonomous systems.

Big picture

Closed-loop self-evolution frameworks that reuse agent traces for targeted task generation
Formal modeling and verification of agent workflows using dependent-type languages
Strategic attack selection as a critical blind spot in agentic AI control evaluations
A surge of domain-specific benchmarks covering macOS GUIs, drug design, and mathematical research discussions

Benchmarks to extract

Socratic-SWE: exact SWE-bench Verified score per iteration (abstract reports 50.40% after 3 rounds)
MacArena: per-model accuracy on macOS-native subset (abstract reports leader trailing >26%)
Attack Selection: exact safety reduction in pp at 1% audit (abstract reports 20pp on BashArena and LinuxArena start policy)
Lean4Agent: verification-passing vs. failing workflow accuracy on SWE-Bench-Verified subset (abstract reports 11.94% average difference)

Papers and links

Agent2606.07412

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

Closed-loop self-evolution distills solving traces into structured skills to generate targeted repair tasks; achieves 50.40% on SWE-bench Verified after three iterations.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2606.06560

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

421 manually verified tasks (50 apps) on native Apple Silicon; model rankings invert between ported and macOS-native tasks (leader trails by >26%), revealing cross-platform GUI competence gaps.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Safety2606.06529

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

Strategic start/stop attack policies reduce measured safety by 20–28pp at 1% audit budget on BashArena and LinuxArena, suggesting control evaluations may be overly optimistic.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

System2606.06523

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

First Lean4-based framework for agent behavior verification; verified workflows outperform failing ones by 11.94% on SWE-Bench-Verified and ELAIP-Bench, with 7.47% further gain from LeanEvolve revision.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2606.07462

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

AARRI-Bench targets granular research behavior; best system (Mini-SWE-Agent + Claude Opus 4.7) achieves only 68.3%, frequently missing subtle details humans catch.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2606.06717

ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets

5,780 shallow-pocket targets from CrossDocked2020; SOTA generative models show weaker predicted binding affinity on low-concavity interfaces (e.g., KRAS, MYC), highlighting a key failure mode.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Method note

Collection of the 164-entry cs.AI June 8, 2026 recent-day listing via arXiv API, deterministic sampling of the top 77 benchmark/SOTA-signal candidates, followed by manual selection of six papers balancing agent systems, safety, and domain-specific benchmarks. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Big picture

Long-running / stateful agent evaluation emerges as a core challenge (SentinelBench, CL-Bench, Agents' Last Exam)
Code-agent benchmarks shift from completion to repository-level tool-use and framework evaluation (TensorBench, ADK Arena)
Safety and jailbreak detection advance with both system-level guards and multi-modal safety benchmarks (GuardNet, MCBench, SlotGCG)
New training recipes (GRPO variants, CKA distillation, hypernetwork adapters) promise more efficient and reliable reasoning and control

Benchmarks to extract

SentinelBench task completion and reaction-time tradeoffs (table in §4)
CL-Bench gain metric isolating online learning from model capability
TensorBench pass rates and pairwise Cohen's κ across 7 agents
MCBench cross-modal safety accuracy breakdown by modality combination and risk category

Papers and links

Benchmark2606.05342

SentinelBench: A Benchmark for Long-Running Monitoring Agents

100 tasks across 10 synthetic web environments test whether agents can monitor, wait, and react promptly over time rather than acting continuously.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2606.05661

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

First expert-validated benchmark for continual learning across 6 domains; naive ICL outperforms dedicated memory systems, exposing headroom for better stateful architectures.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2606.05570

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

199 repo-level tasks on a real PyTorch extension; strongest agent passes 64.8%, and pairwise agreement between agents is low (κ=0.05), revealing task-specific skill gaps.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Safety2606.05177

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

1196 scenarios across vision, audio, and text; current Omni LLMs fail to integrate cross-modal cues for safety judgments, performing better only when salient signals are present.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Agent2606.05922

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

Self-supervised method that re-solves past tasks and selects harness updates by pairwise self-preference; improves SWE-Bench Pro pass rate from 59% to 78% without external labels.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Reasoning2606.06468

Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement

Uses blueprint dependency graphs and parallel Lean proving with DeepSeek-V4-Flash; achieves 99.2% pass@1 on MiniF2F-test and 75.6% on PutnamBench at a fraction of prior cost.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Method note

Collected all 280 cs.AI entries from the June 5 arXiv recent-day listing, filtered 80 deterministic candidates with strong benchmark or SOTA language, sampled metadata for 60, and selected 6 papers with the highest combined artifact signal, domain diversity, and practical impact on agent/benchmark/safety evaluation. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Benchmarks and systems for agent safety, long-horizon code optimization, consequence-aware compute allocation, and structured skill learning dominate today's cs.AI submissions.

Big picture

Agent safety and intervention timing: new benchmarks for AI companion safety and a critical look at LLM-judge-based intervention triggers
Long-horizon and consequence-aware reasoning: benchmarks for iterative code optimization and compute allocation based on task cost
Structured skill and memory for agents: graph-based skill protocols, parametric memory for self-evolving agents, and state-grounded skill retrieval
Cross-domain and multi-modal benchmarks: unified CAD, end-to-end autonomous driving, and cybersecurity vulnerability lifecycle evaluation

Benchmarks to extract

AICompanionBench: per-category accuracy of LLM-as-judge on 9 safety risk categories (e.g., manipulation, self-harm)
AutoLab: task completion rate under wall-clock budget across 36 long-horizon optimization tasks
CyberGym-E2E: end-to-end success rate on vulnerability discovery, PoC generation, and patch generation for 920 CVEs
SWE-bench Lite: cost-weighted loss reduction under consequence-aware vs. difficulty-aware compute allocation

Papers and links

Benchmark2606.04867

AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

First public benchmark (2,123 real Replika conversations) for fine-grained safety risk categories; reveals LLMs struggle with implicit unsafe interactions like manipulation.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

System2606.04402

Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

Proposes consequence-aware test-time compute allocation; on SWE-bench Lite, reduces cost-weighted loss by 22–33% vs. difficulty-only routing.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2606.05080

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

36 expert-curated tasks for ultra long-horizon closed-loop optimization; finds persistence—not initial quality—is the dominant success predictor.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Agent2606.04536

Scaling Self-Evolving Agents via Parametric Memory

Introduces TMEM: agents absorb distilled supervision into fast LoRA weights via online updates, outperforming summary/retrieval baselines on LoCoMo and LongMemEval.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

System2606.04781

AIP: A Graph Representation for Learning and Governing Agent Skills

Models skills as directed execution graphs; compiling human-written skills to AIP raised Claude Sonnet's mean task reward from 0.60 to 0.71 on SkillsBench.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2606.04460

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

920 real-world vulnerabilities across 139 open-source projects; evaluates full lifecycle of discovery, PoC generation, and patch generation.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Method note

Sampled 60 of 207 cs.AI entries from the June 4 arXiv listing, then selected 6 papers with the strongest benchmark, agent, safety, or infrastructure signals based on artifact names, known benchmarks, and SOTA language in titles and abstracts. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Big picture

Agent benchmarks for high-stakes domains: finance (BigFinanceBench, Hedge-Bench) and clinical software (MedCUA-Bench) show frontier models score below 60%.
Process-level evaluation gains traction: AgentLens reveals 10.7% of passing SWE-agent trajectories are 'Lucky Passes' that tests miss.
Resource-efficient evaluation: ProEval uses transfer learning to estimate performance with 8-65x fewer samples; TriEval runs bias/toxicity checks on a laptop.
Safety and abstention: AgentRedBench cuts indirect prompt injection ASR from 69.9% to 2.4%; a new taxonomy formalizes when agents should refuse to act.

Benchmarks to extract

Extract BigFinanceBench rubric scores per workflow step to localize where agents fail (best system 58.8%).
Confirm Hedge-Bench deterministic grading methodology and whether the <16% score holds for GPT-5 and Claude Opus 4.
Verify AgentLens-Bench 'Lucky Pass' rate (10.7%) and whether model rankings shift by 5+ positions when using quality score vs. pass rate.
Extract AgentRedBench ASR per model and per attack type to confirm the 2.4% ASR with AgentRedGuard at 0.37% FPR.

Papers and links

Benchmark2606.03829

BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

928 expert-authored financial tasks with point-weighted rubrics; best agent scores only 58.8%.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2606.03918

Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

102 real hedge-fund analyst tasks with deterministic grading; frontier models score below 16%.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2606.03203

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

18 clinical scenarios across 10 domains; best closed-source model reaches 54.2% strict success, open-source agents average 2.5%.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2605.12925

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Process-level analysis of 2,614 SWE-agent trajectories; 10.7% of passing runs are 'Lucky Passes' with chaotic behavior.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

System2604.23099

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

Transfer-learning framework using Gaussian Processes; estimates performance within 1% of ground truth with 8-65x fewer samples.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Safety2606.02240

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

215 indirect prompt injection scenarios across 24 enterprise integrations; guard model cuts attack success rate from 69.9% to 2.4%.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Method note

Sampled 60 high-signal entries from 440 total (80 new, 203 cross-list, 157 replacements) by prioritizing papers with named benchmarks, SOTA claims, or safety/agent artifacts in title or abstract, then selected 6 for the digest. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Big picture

Agent safety and deception detection: SPADE-Bench and TOP-Bench formalize plan-action divergence and tool-orchestration privacy leakage, showing that current agents leak sensitive conclusions at high rates.
Benchmark evolution and efficiency: BenchEvolver automatically generates harder coding problems from saturated benchmarks, while MIS-based prompt selection reduces evaluation cost without distorting rankings.
Domain-specific agent evaluation: New benchmarks for smart homes (SMH-Bench), local services (LocalSearchBench), and mobile GUI agents (MobiBench) reveal that frontier models struggle with multi-hop reasoning and ambiguity in vertical domains.
Process-level diagnosis: AgentProcessBench and AutoMedBench provide step-level scoring of agent trajectories, showing that verification and error-handling stages are the weakest links in long-horizon tasks.

Benchmarks to extract

BenchEvolver: LiveCodeBench-Plus Pass@1 for frontier models (extract from Table 2).
SPADE-Bench: Leakage rate and H-score across models (extract from main results).
AgentProcessBench: Step-level accuracy of process reward models vs. outcome supervision (extract from Table 3).
SMH-Bench: Accuracy breakdown by task category and home complexity (extract from Table 2).

Papers and links

Benchmark2606.01286

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Evolves saturated coding benchmarks into harder variants; LiveCodeBench-Plus restores discrimination (Pass@1 27.5–62.6% for frontier models). RL on evolved tasks improves held-out coding by +8.7 points.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Safety2606.02380

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

First benchmark to isolate agent deception (plan-action divergence under pressure) from hallucination; reveals that deception is a genuine and pressing issue in tool-use contexts.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Safety2512.16310

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

TOP-Bench measures compositional privacy leakage from tool returns; average leakage rate 88.6% across six LLM agents. TOP-Align (SFT+DPO) improves H-score by 16.2 points.

CodeSOTA paper arXiv abstract PDF Code CodeSOTA area

Benchmark2603.14465

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

1,000 trajectories with 8,509 human step annotations (89.1% agreement). Ternary labeling captures exploration; process signals complement outcome supervision for test-time scaling.

CodeSOTA paper arXiv abstract PDF Code CodeSOTA area

Benchmark2606.01912

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

1,100 tasks across 7 categories in an executable smart-home simulator. Frontier LLMs fail on automation scheduling and ambiguity handling as home complexity increases.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Reasoning2606.00376

The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary

Proves an attention bottleneck bound on state-tracking capacity; identifies a deterministic horizon d* ∈ [19,31] beyond which tool delegation is necessary. Tool-integrated reasoning reaches 86–94% vs. 24–42% for neural CoT.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Method note

Sampled 60 entries from 946 total (173 new, 404 cross-list, 369 replacements) via deterministic priority scoring for benchmark/artifact signals, then selected 6 papers balancing benchmark novelty, safety relevance, and system impact. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Big picture

Agentic routing and delegation benchmarks surface fundamental limitations in current orchestration methods
Long-horizon software development and memory evaluation tasks expose sharp drops in performance as context grows
Domain-specific benchmarks for healthcare and engineering construction highlight the need for specialized evaluation
System-level efficiency gains from dynamic layer routing demonstrate progress in adaptive inference

Benchmarks to extract

Verify that TwinRouterBench highest success rate is 64.8% for computer-use models.
Verify that DecisionBench routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions.
Verify that RoadmapBench Claude-Opus-4.7 resolves only 39.1% of tasks.
Verify that MINTEval average accuracy across all systems is 27.9%.

Papers and links

Benchmark2605.18859

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

Step-level LLM routing benchmark with static and dynamic tracks for agentic workflows

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2605.19099

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

Emergent delegation evaluation across GAIA, BFCL, and tau-bench

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2605.15846

RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades

115 long-horizon coding tasks from real version upgrades across 17 repos

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2605.18565

MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

15.6k QA pairs over long contexts averaging 138.8k tokens for multi-target memory

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2605.16679

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

End-to-end healthcare workflow automation with 20 apps and 87 MCP tools

CodeSOTA paper arXiv abstract PDF CodeSOTA area

System2510.12773

Dr.LLM: Dynamic Layer Routing in LLMs

Dynamic layer routing with MCTS-supervised per-layer routers for efficient LLM inference

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Method note

Sampled 60 of 460 entries, prioritizing benchmarks and systems with the strongest quantitative signal from a deterministic candidate ranking. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.

Big picture

Runtime agent safety is moving from prompt policies to action interception, MCP monitoring, and host-side controls.
Evaluation is shifting toward process-aware tasks: tool trajectories, delivered artifacts, multimodal verification, and human-validated rubrics.
Self-improving agents are becoming governed systems with rollback, canary tests, experience graphs, and explicit lifecycle controls.
Reasoning work is converging on sparse credit assignment: find the decision tokens or reasoning steps that actually steer the answer.

Benchmarks to extract

TOBench for tool-using agent rows
ADR-Bench and SLEIGHT-Bench for agent security
WebGameBench for coding-agent delivery
LinAlg-Bench, CAM-Bench, and GIM for reasoning diagnostics

Papers and links

Benchmark2605.16909

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

100 executable tasks, 27 MCP servers, 324 tools, and closed-loop multimodal verification for end-to-end tool use.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Safety2605.17380

ADR: An Agentic Detection and Response System

Production-style monitoring for MCP agent activity with ADR-Bench covering 302 tasks and 17 attack techniques.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2605.17637

WebGameBench: Requirement-to-Application Evaluation for Coding Agents

Evaluates browser-accessible delivered games, separating minimum working delivery from excellent requirement satisfaction.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2605.16675

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes

660 SymPy-certified linear algebra problems plus a failure taxonomy for diagnosing mathematical reasoning.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Reasoning2605.16874

Reasoning Can Be Restored by Correcting a Few Decision Tokens

Claims reasoning failures concentrate in a small number of early tokens, useful for intervention and evaluation design.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Agent2605.17721

EXG: Self-Evolving Agents with Experience Graphs

Turns successes and failures into graph memory, giving self-evolving agents a more inspectable substrate.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Method note

Full arXiv /new batch collected on May 19. The LLM scout covered all new submissions; deterministic benchmark detection covered the full batch. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.

Big picture

Real-world web agents are getting eval environments that look more like actual SaaS and commerce workflows.
Agent architecture work is becoming cost-aware: context, hierarchy, reasoning depth, and monitoring are treated as budgeted design choices.
Formal methods are appearing as runtime guardrails for LLM systems rather than only offline verification work.
Medical and robotics papers are packaging open systems around concrete downstream workflows instead of generic model releases.

Benchmarks to extract

ShopGym and SaaS-Bench for web-agent task pages
PAGER/PAGE Bench for long-form or page-level agent evaluation
ToxiAlert-Bench and RoadmapBench from the deterministic screen
VLA-AD and RTL-BenchMT for embodied and hardware-facing rows

Papers and links

Benchmark2605.15777

SaaS-Bench

A practical benchmark direction for agents operating across SaaS workflows, useful for procurement-style agent evaluation.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2605.16116

ShopGym

High-priority e-commerce agent environment surfaced by both the scout and deterministic screen.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Agent2605.16233

FORGE

Agent architecture signal around designing and evaluating practical multi-step systems.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Safety2605.16198

Formal Methods Meet LLMs

Runtime monitoring and formal constraints as a concrete control layer for LLM agents.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Agent2605.16217

Argus Deep Research Agents

Deep-research agent work that matches CodeSOTA's interest in paper-to-evidence workflows.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

System2605.16215

Fully Open Meditron

Open medical AI system surfaced as a practical extraction target for model, data, and benchmark claims.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Method note

The dated May 18 arXiv /recent section was collected in full. We stopped the full LLM run after your subsample instruction and used a 60-paper scout plus deterministic screening across all 200 summaries. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.

Big picture

Agent orchestration papers are moving toward explicit graphs, patterns, and inspectable memory instead of one-off prompt chains.
Education and industrial benchmarks are becoming stronger examples of domain-specific agent evaluation.
Safety and governance papers are trying to separate the task being performed from the governance process wrapped around it.
Reasoning papers continue to probe symbolic structure, attributes, and limits of model-based inference.

Benchmarks to extract

EntityBench, ClawForge, EduAgentBench, and Herculean
PDI-Bench, Collider-Bench, and XDomainBench
EduFrameTrap for sycophancy and education-agent failure modes
SimPersona for persona or simulation-agent evaluation

Papers and links

Agent2605.13848

GraphBit

Agent orchestration and reliability signal around graph-structured coordination.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Agent2605.13880

PREPING

Agent memory cold-start work; useful for understanding reusable memory in deployed agents.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2605.14133

ClawForge

High-priority benchmark candidate from the Friday benchmark report.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2605.14322

EduAgentBench

Education-agent evaluation target with obvious CodeSOTA task-page value.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Benchmark2605.15199

EntityBench

Entity-centered benchmark surfaced as a strong extraction candidate.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Safety2605.14744

Governance-task decoupling

Useful framing for separating operational task success from oversight and governance quality.

CodeSOTA paper arXiv abstract PDF CodeSOTA area

Method note

This uses the existing May 15 batch and reports already in the local paper pipeline. Claims here are abstract/report level until tables are extracted from individual PDFs. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Next extraction pass

Turn the calendar into benchmark evidence, not just reading notes.

The valuable follow-up is to pull benchmark tables, model lists, task definitions, and failure taxonomies from the strongest papers. That gives CodeSOTA rows users can compare, not merely links they can browse.

Useful routes

Agentic AI benchmarks and systems Benchmark registry Search agent benchmark evidence Recent paper calendar