This day in AI

Recent paper calendar for people tracking useful AI shifts.

CodeSOTA scans arXiv days for papers that could become useful benchmark rows, model-selection evidence, or product-facing research notes. This page keeps the daily trail visible instead of burying it in one-off reports.

Calendar

Research days worth revisiting

Open recent-paper calendar

Tuesday

June 9, 2026

Today's batch is dominated by new benchmarks for agentic systems, safety, and multimodal reasoning, alongside a strong signal for structured evaluation and self-evolving architectures.

509 entries

Monday

June 8, 2026

Agentic self-evolution, formal verification, safety evaluations, and new domain-specific benchmarks dominate Monday's batch, revealing growing rigor in measuring and improving autonomous systems.

164 entries

Friday

June 5, 2026

A dense crop of task-specific benchmarks and agent systems reveals that frontier models saturate narrow perception tests but still fail on long-horizon, stateful, and multi-modal safety reasoning, while new training and inference techniques show materials for closing those gaps.

280 entries

Thursday

June 4, 2026

Benchmarks and systems for agent safety, long-horizon code optimization, consequence-aware compute allocation, and structured skill learning dominate today's cs.AI submissions.

207 entries

Wednesday

June 3, 2026

Today's batch is dominated by rigorous new benchmarks that expose agent fragility in finance, medicine, and software engineering, alongside system papers that improve evaluation efficiency and safety.

440 entries, including 80 new submissions

Tuesday

June 2, 2026

Today's batch is dominated by new benchmarks that expose persistent gaps in agentic reasoning, tool-use safety, and domain-specific evaluation, while also introducing systems for automated benchmark evolution and step-level process diagnosis.

946 entries, including 173 new submissions

Wednesday

May 20, 2026

New benchmarks for agentic routing, long-horizon software development, memory evaluation, and healthcare workflows reveal persistent gaps in frontier models, while dynamic layer routing offers a path to more efficient LLM inference.

460 entries, including 70 new submissions

Tuesday

May 19, 2026

The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.

900 entries, including 142 new submissions

Monday

May 18, 2026

Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.

200 entries, sampled LLM scout plus full deterministic screen

Friday

May 15, 2026

Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.

445 entries, including 100 new submissions

Today's batch is dominated by new benchmarks for agentic systems, safety, and multimodal reasoning, alongside a strong signal for structured evaluation and self-evolving architectures.

Big picture

  • Agent benchmarks expand to long-horizon, hybrid-interface, and adversarial settings
  • Safety and alignment work targets emergent misalignment, compliance, and overconfidence
  • Structured evaluation frameworks for tabular, medical, and formal reasoning domains
  • Self-evolution and influence-guided training for reasoning improvement

Benchmarks to extract

  • WeaveBench: extract PassRate and trajectory-aware judge correlation on hybrid tasks
  • KernelBench: confirm attack success rate reduction from 62% to 0% after hacker-fixer loop
  • TRL-Bench: verify that generic text encoders lead on surface-text tasks, tabular specialists on aligned tasks
  • IMUG-Bench: confirm exposure bias gap between understanding and generation in multi-turn settings

Papers and links

Method note

Sampled 80 deterministic candidates from 509 cs.AI entries, prioritized by artifact names, benchmark language, and quantitative evidence, then selected 6 papers covering agent benchmarks, safety, reasoning, and structured evaluation. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Agentic self-evolution, formal verification, safety evaluations, and new domain-specific benchmarks dominate Monday's batch, revealing growing rigor in measuring and improving autonomous systems.

Big picture

  • Closed-loop self-evolution frameworks that reuse agent traces for targeted task generation
  • Formal modeling and verification of agent workflows using dependent-type languages
  • Strategic attack selection as a critical blind spot in agentic AI control evaluations
  • A surge of domain-specific benchmarks covering macOS GUIs, drug design, and mathematical research discussions

Benchmarks to extract

  • Socratic-SWE: exact SWE-bench Verified score per iteration (abstract reports 50.40% after 3 rounds)
  • MacArena: per-model accuracy on macOS-native subset (abstract reports leader trailing >26%)
  • Attack Selection: exact safety reduction in pp at 1% audit (abstract reports 20pp on BashArena and LinuxArena start policy)
  • Lean4Agent: verification-passing vs. failing workflow accuracy on SWE-Bench-Verified subset (abstract reports 11.94% average difference)

Papers and links

Method note

Collection of the 164-entry cs.AI June 8, 2026 recent-day listing via arXiv API, deterministic sampling of the top 77 benchmark/SOTA-signal candidates, followed by manual selection of six papers balancing agent systems, safety, and domain-specific benchmarks. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

A dense crop of task-specific benchmarks and agent systems reveals that frontier models saturate narrow perception tests but still fail on long-horizon, stateful, and multi-modal safety reasoning, while new training and inference techniques show materials for closing those gaps.

Big picture

  • Long-running / stateful agent evaluation emerges as a core challenge (SentinelBench, CL-Bench, Agents' Last Exam)
  • Code-agent benchmarks shift from completion to repository-level tool-use and framework evaluation (TensorBench, ADK Arena)
  • Safety and jailbreak detection advance with both system-level guards and multi-modal safety benchmarks (GuardNet, MCBench, SlotGCG)
  • New training recipes (GRPO variants, CKA distillation, hypernetwork adapters) promise more efficient and reliable reasoning and control

Benchmarks to extract

  • SentinelBench task completion and reaction-time tradeoffs (table in §4)
  • CL-Bench gain metric isolating online learning from model capability
  • TensorBench pass rates and pairwise Cohen's κ across 7 agents
  • MCBench cross-modal safety accuracy breakdown by modality combination and risk category

Papers and links

Method note

Collected all 280 cs.AI entries from the June 5 arXiv recent-day listing, filtered 80 deterministic candidates with strong benchmark or SOTA language, sampled metadata for 60, and selected 6 papers with the highest combined artifact signal, domain diversity, and practical impact on agent/benchmark/safety evaluation. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Benchmarks and systems for agent safety, long-horizon code optimization, consequence-aware compute allocation, and structured skill learning dominate today's cs.AI submissions.

Big picture

  • Agent safety and intervention timing: new benchmarks for AI companion safety and a critical look at LLM-judge-based intervention triggers
  • Long-horizon and consequence-aware reasoning: benchmarks for iterative code optimization and compute allocation based on task cost
  • Structured skill and memory for agents: graph-based skill protocols, parametric memory for self-evolving agents, and state-grounded skill retrieval
  • Cross-domain and multi-modal benchmarks: unified CAD, end-to-end autonomous driving, and cybersecurity vulnerability lifecycle evaluation

Benchmarks to extract

  • AICompanionBench: per-category accuracy of LLM-as-judge on 9 safety risk categories (e.g., manipulation, self-harm)
  • AutoLab: task completion rate under wall-clock budget across 36 long-horizon optimization tasks
  • CyberGym-E2E: end-to-end success rate on vulnerability discovery, PoC generation, and patch generation for 920 CVEs
  • SWE-bench Lite: cost-weighted loss reduction under consequence-aware vs. difficulty-aware compute allocation

Papers and links

Method note

Sampled 60 of 207 cs.AI entries from the June 4 arXiv listing, then selected 6 papers with the strongest benchmark, agent, safety, or infrastructure signals based on artifact names, known benchmarks, and SOTA language in titles and abstracts. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Today's batch is dominated by rigorous new benchmarks that expose agent fragility in finance, medicine, and software engineering, alongside system papers that improve evaluation efficiency and safety.

Big picture

  • Agent benchmarks for high-stakes domains: finance (BigFinanceBench, Hedge-Bench) and clinical software (MedCUA-Bench) show frontier models score below 60%.
  • Process-level evaluation gains traction: AgentLens reveals 10.7% of passing SWE-agent trajectories are 'Lucky Passes' that tests miss.
  • Resource-efficient evaluation: ProEval uses transfer learning to estimate performance with 8-65x fewer samples; TriEval runs bias/toxicity checks on a laptop.
  • Safety and abstention: AgentRedBench cuts indirect prompt injection ASR from 69.9% to 2.4%; a new taxonomy formalizes when agents should refuse to act.

Benchmarks to extract

  • Extract BigFinanceBench rubric scores per workflow step to localize where agents fail (best system 58.8%).
  • Confirm Hedge-Bench deterministic grading methodology and whether the <16% score holds for GPT-5 and Claude Opus 4.
  • Verify AgentLens-Bench 'Lucky Pass' rate (10.7%) and whether model rankings shift by 5+ positions when using quality score vs. pass rate.
  • Extract AgentRedBench ASR per model and per attack type to confirm the 2.4% ASR with AgentRedGuard at 0.37% FPR.

Papers and links

Method note

Sampled 60 high-signal entries from 440 total (80 new, 203 cross-list, 157 replacements) by prioritizing papers with named benchmarks, SOTA claims, or safety/agent artifacts in title or abstract, then selected 6 for the digest. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Today's batch is dominated by new benchmarks that expose persistent gaps in agentic reasoning, tool-use safety, and domain-specific evaluation, while also introducing systems for automated benchmark evolution and step-level process diagnosis.

Big picture

  • Agent safety and deception detection: SPADE-Bench and TOP-Bench formalize plan-action divergence and tool-orchestration privacy leakage, showing that current agents leak sensitive conclusions at high rates.
  • Benchmark evolution and efficiency: BenchEvolver automatically generates harder coding problems from saturated benchmarks, while MIS-based prompt selection reduces evaluation cost without distorting rankings.
  • Domain-specific agent evaluation: New benchmarks for smart homes (SMH-Bench), local services (LocalSearchBench), and mobile GUI agents (MobiBench) reveal that frontier models struggle with multi-hop reasoning and ambiguity in vertical domains.
  • Process-level diagnosis: AgentProcessBench and AutoMedBench provide step-level scoring of agent trajectories, showing that verification and error-handling stages are the weakest links in long-horizon tasks.

Benchmarks to extract

  • BenchEvolver: LiveCodeBench-Plus Pass@1 for frontier models (extract from Table 2).
  • SPADE-Bench: Leakage rate and H-score across models (extract from main results).
  • AgentProcessBench: Step-level accuracy of process reward models vs. outcome supervision (extract from Table 3).
  • SMH-Bench: Accuracy breakdown by task category and home complexity (extract from Table 2).

Papers and links

Method note

Sampled 60 entries from 946 total (173 new, 404 cross-list, 369 replacements) via deterministic priority scoring for benchmark/artifact signals, then selected 6 papers balancing benchmark novelty, safety relevance, and system impact. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

New benchmarks for agentic routing, long-horizon software development, memory evaluation, and healthcare workflows reveal persistent gaps in frontier models, while dynamic layer routing offers a path to more efficient LLM inference.

Big picture

  • Agentic routing and delegation benchmarks surface fundamental limitations in current orchestration methods
  • Long-horizon software development and memory evaluation tasks expose sharp drops in performance as context grows
  • Domain-specific benchmarks for healthcare and engineering construction highlight the need for specialized evaluation
  • System-level efficiency gains from dynamic layer routing demonstrate progress in adaptive inference

Benchmarks to extract

  • Verify that TwinRouterBench highest success rate is 64.8% for computer-use models.
  • Verify that DecisionBench routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions.
  • Verify that RoadmapBench Claude-Opus-4.7 resolves only 39.1% of tasks.
  • Verify that MINTEval average accuracy across all systems is 27.9%.

Papers and links

Method note

Sampled 60 of 460 entries, prioritizing benchmarks and systems with the strongest quantitative signal from a deterministic candidate ranking. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.

Big picture

  • Runtime agent safety is moving from prompt policies to action interception, MCP monitoring, and host-side controls.
  • Evaluation is shifting toward process-aware tasks: tool trajectories, delivered artifacts, multimodal verification, and human-validated rubrics.
  • Self-improving agents are becoming governed systems with rollback, canary tests, experience graphs, and explicit lifecycle controls.
  • Reasoning work is converging on sparse credit assignment: find the decision tokens or reasoning steps that actually steer the answer.

Benchmarks to extract

  • TOBench for tool-using agent rows
  • ADR-Bench and SLEIGHT-Bench for agent security
  • WebGameBench for coding-agent delivery
  • LinAlg-Bench, CAM-Bench, and GIM for reasoning diagnostics

Papers and links

Method note

Full arXiv /new batch collected on May 19. The LLM scout covered all new submissions; deterministic benchmark detection covered the full batch. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.

Big picture

  • Real-world web agents are getting eval environments that look more like actual SaaS and commerce workflows.
  • Agent architecture work is becoming cost-aware: context, hierarchy, reasoning depth, and monitoring are treated as budgeted design choices.
  • Formal methods are appearing as runtime guardrails for LLM systems rather than only offline verification work.
  • Medical and robotics papers are packaging open systems around concrete downstream workflows instead of generic model releases.

Benchmarks to extract

  • ShopGym and SaaS-Bench for web-agent task pages
  • PAGER/PAGE Bench for long-form or page-level agent evaluation
  • ToxiAlert-Bench and RoadmapBench from the deterministic screen
  • VLA-AD and RTL-BenchMT for embodied and hardware-facing rows

Papers and links

Method note

The dated May 18 arXiv /recent section was collected in full. We stopped the full LLM run after your subsample instruction and used a 60-paper scout plus deterministic screening across all 200 summaries. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.

Big picture

  • Agent orchestration papers are moving toward explicit graphs, patterns, and inspectable memory instead of one-off prompt chains.
  • Education and industrial benchmarks are becoming stronger examples of domain-specific agent evaluation.
  • Safety and governance papers are trying to separate the task being performed from the governance process wrapped around it.
  • Reasoning papers continue to probe symbolic structure, attributes, and limits of model-based inference.

Benchmarks to extract

  • EntityBench, ClawForge, EduAgentBench, and Herculean
  • PDI-Bench, Collider-Bench, and XDomainBench
  • EduFrameTrap for sycophancy and education-agent failure modes
  • SimPersona for persona or simulation-agent evaluation

Papers and links

Method note

This uses the existing May 15 batch and reports already in the local paper pipeline. Claims here are abstract/report level until tables are extracted from individual PDFs. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Next extraction pass

Turn the calendar into benchmark evidence, not just reading notes.

The valuable follow-up is to pull benchmark tables, model lists, task definitions, and failure taxonomies from the strongest papers. That gives CodeSOTA rows users can compare, not merely links they can browse.