WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces
114 tasks requiring GUI+CLI+code orchestration; best PassRate 41.2%.
This day in AI
CodeSOTA scans arXiv days for papers that could become useful benchmark rows, model-selection evidence, or product-facing research notes. This page keeps the daily trail visible instead of burying it in one-off reports.
Calendar
Tuesday
Today's batch is dominated by new benchmarks for agentic systems, safety, and multimodal reasoning, alongside a strong signal for structured evaluation and self-evolving architectures.
509 entries
Monday
Agentic self-evolution, formal verification, safety evaluations, and new domain-specific benchmarks dominate Monday's batch, revealing growing rigor in measuring and improving autonomous systems.
164 entries
Friday
A dense crop of task-specific benchmarks and agent systems reveals that frontier models saturate narrow perception tests but still fail on long-horizon, stateful, and multi-modal safety reasoning, while new training and inference techniques show materials for closing those gaps.
280 entries
Thursday
Benchmarks and systems for agent safety, long-horizon code optimization, consequence-aware compute allocation, and structured skill learning dominate today's cs.AI submissions.
207 entries
Wednesday
Today's batch is dominated by rigorous new benchmarks that expose agent fragility in finance, medicine, and software engineering, alongside system papers that improve evaluation efficiency and safety.
440 entries, including 80 new submissions
Tuesday
Today's batch is dominated by new benchmarks that expose persistent gaps in agentic reasoning, tool-use safety, and domain-specific evaluation, while also introducing systems for automated benchmark evolution and step-level process diagnosis.
946 entries, including 173 new submissions
Wednesday
New benchmarks for agentic routing, long-horizon software development, memory evaluation, and healthcare workflows reveal persistent gaps in frontier models, while dynamic layer routing offers a path to more efficient LLM inference.
460 entries, including 70 new submissions
Tuesday
The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.
900 entries, including 142 new submissions
Monday
Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.
200 entries, sampled LLM scout plus full deterministic screen
Friday
Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.
445 entries, including 100 new submissions
Today's batch is dominated by new benchmarks for agentic systems, safety, and multimodal reasoning, alongside a strong signal for structured evaluation and self-evolving architectures.
Big picture
Benchmarks to extract
Papers and links
114 tasks requiring GUI+CLI+code orchestration; best PassRate 41.2%.
Hacker-fixer loop drives attack success from 62% to 0% on KernelBench.
Co-training framework with DuGRPO; 20% relative gain on Olympiad/SuperGPQA.
20 models, 16 tasks; shows encoder quality is capability-specific.
3,113 samples across static, temporal, and hybrid tasks; exposes exposure bias.
Training-free commit gate reduces false commits from 30-42% to near zero.
Method note
Sampled 80 deterministic candidates from 509 cs.AI entries, prioritized by artifact names, benchmark language, and quantitative evidence, then selected 6 papers covering agent benchmarks, safety, reasoning, and structured evaluation. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Agentic self-evolution, formal verification, safety evaluations, and new domain-specific benchmarks dominate Monday's batch, revealing growing rigor in measuring and improving autonomous systems.
Big picture
Benchmarks to extract
Papers and links
Closed-loop self-evolution distills solving traces into structured skills to generate targeted repair tasks; achieves 50.40% on SWE-bench Verified after three iterations.
421 manually verified tasks (50 apps) on native Apple Silicon; model rankings invert between ported and macOS-native tasks (leader trails by >26%), revealing cross-platform GUI competence gaps.
Strategic start/stop attack policies reduce measured safety by 20–28pp at 1% audit budget on BashArena and LinuxArena, suggesting control evaluations may be overly optimistic.
First Lean4-based framework for agent behavior verification; verified workflows outperform failing ones by 11.94% on SWE-Bench-Verified and ELAIP-Bench, with 7.47% further gain from LeanEvolve revision.
AARRI-Bench targets granular research behavior; best system (Mini-SWE-Agent + Claude Opus 4.7) achieves only 68.3%, frequently missing subtle details humans catch.
5,780 shallow-pocket targets from CrossDocked2020; SOTA generative models show weaker predicted binding affinity on low-concavity interfaces (e.g., KRAS, MYC), highlighting a key failure mode.
Method note
Collection of the 164-entry cs.AI June 8, 2026 recent-day listing via arXiv API, deterministic sampling of the top 77 benchmark/SOTA-signal candidates, followed by manual selection of six papers balancing agent systems, safety, and domain-specific benchmarks. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
A dense crop of task-specific benchmarks and agent systems reveals that frontier models saturate narrow perception tests but still fail on long-horizon, stateful, and multi-modal safety reasoning, while new training and inference techniques show materials for closing those gaps.
Big picture
Benchmarks to extract
Papers and links
100 tasks across 10 synthetic web environments test whether agents can monitor, wait, and react promptly over time rather than acting continuously.
First expert-validated benchmark for continual learning across 6 domains; naive ICL outperforms dedicated memory systems, exposing headroom for better stateful architectures.
199 repo-level tasks on a real PyTorch extension; strongest agent passes 64.8%, and pairwise agreement between agents is low (κ=0.05), revealing task-specific skill gaps.
1196 scenarios across vision, audio, and text; current Omni LLMs fail to integrate cross-modal cues for safety judgments, performing better only when salient signals are present.
Self-supervised method that re-solves past tasks and selects harness updates by pairwise self-preference; improves SWE-Bench Pro pass rate from 59% to 78% without external labels.
Uses blueprint dependency graphs and parallel Lean proving with DeepSeek-V4-Flash; achieves 99.2% pass@1 on MiniF2F-test and 75.6% on PutnamBench at a fraction of prior cost.
Method note
Collected all 280 cs.AI entries from the June 5 arXiv recent-day listing, filtered 80 deterministic candidates with strong benchmark or SOTA language, sampled metadata for 60, and selected 6 papers with the highest combined artifact signal, domain diversity, and practical impact on agent/benchmark/safety evaluation. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Benchmarks and systems for agent safety, long-horizon code optimization, consequence-aware compute allocation, and structured skill learning dominate today's cs.AI submissions.
Big picture
Benchmarks to extract
Papers and links
First public benchmark (2,123 real Replika conversations) for fine-grained safety risk categories; reveals LLMs struggle with implicit unsafe interactions like manipulation.
Proposes consequence-aware test-time compute allocation; on SWE-bench Lite, reduces cost-weighted loss by 22–33% vs. difficulty-only routing.
36 expert-curated tasks for ultra long-horizon closed-loop optimization; finds persistence—not initial quality—is the dominant success predictor.
Introduces TMEM: agents absorb distilled supervision into fast LoRA weights via online updates, outperforming summary/retrieval baselines on LoCoMo and LongMemEval.
Models skills as directed execution graphs; compiling human-written skills to AIP raised Claude Sonnet's mean task reward from 0.60 to 0.71 on SkillsBench.
920 real-world vulnerabilities across 139 open-source projects; evaluates full lifecycle of discovery, PoC generation, and patch generation.
Method note
Sampled 60 of 207 cs.AI entries from the June 4 arXiv listing, then selected 6 papers with the strongest benchmark, agent, safety, or infrastructure signals based on artifact names, known benchmarks, and SOTA language in titles and abstracts. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Today's batch is dominated by rigorous new benchmarks that expose agent fragility in finance, medicine, and software engineering, alongside system papers that improve evaluation efficiency and safety.
Big picture
Benchmarks to extract
Papers and links
928 expert-authored financial tasks with point-weighted rubrics; best agent scores only 58.8%.
102 real hedge-fund analyst tasks with deterministic grading; frontier models score below 16%.
18 clinical scenarios across 10 domains; best closed-source model reaches 54.2% strict success, open-source agents average 2.5%.
Process-level analysis of 2,614 SWE-agent trajectories; 10.7% of passing runs are 'Lucky Passes' with chaotic behavior.
Transfer-learning framework using Gaussian Processes; estimates performance within 1% of ground truth with 8-65x fewer samples.
215 indirect prompt injection scenarios across 24 enterprise integrations; guard model cuts attack success rate from 69.9% to 2.4%.
Method note
Sampled 60 high-signal entries from 440 total (80 new, 203 cross-list, 157 replacements) by prioritizing papers with named benchmarks, SOTA claims, or safety/agent artifacts in title or abstract, then selected 6 for the digest. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Today's batch is dominated by new benchmarks that expose persistent gaps in agentic reasoning, tool-use safety, and domain-specific evaluation, while also introducing systems for automated benchmark evolution and step-level process diagnosis.
Big picture
Benchmarks to extract
Papers and links
Evolves saturated coding benchmarks into harder variants; LiveCodeBench-Plus restores discrimination (Pass@1 27.5–62.6% for frontier models). RL on evolved tasks improves held-out coding by +8.7 points.
First benchmark to isolate agent deception (plan-action divergence under pressure) from hallucination; reveals that deception is a genuine and pressing issue in tool-use contexts.
TOP-Bench measures compositional privacy leakage from tool returns; average leakage rate 88.6% across six LLM agents. TOP-Align (SFT+DPO) improves H-score by 16.2 points.
1,000 trajectories with 8,509 human step annotations (89.1% agreement). Ternary labeling captures exploration; process signals complement outcome supervision for test-time scaling.
1,100 tasks across 7 categories in an executable smart-home simulator. Frontier LLMs fail on automation scheduling and ambiguity handling as home complexity increases.
Proves an attention bottleneck bound on state-tracking capacity; identifies a deterministic horizon d* ∈ [19,31] beyond which tool delegation is necessary. Tool-integrated reasoning reaches 86–94% vs. 24–42% for neural CoT.
Method note
Sampled 60 entries from 946 total (173 new, 404 cross-list, 369 replacements) via deterministic priority scoring for benchmark/artifact signals, then selected 6 papers balancing benchmark novelty, safety relevance, and system impact. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
New benchmarks for agentic routing, long-horizon software development, memory evaluation, and healthcare workflows reveal persistent gaps in frontier models, while dynamic layer routing offers a path to more efficient LLM inference.
Big picture
Benchmarks to extract
Papers and links
Step-level LLM routing benchmark with static and dynamic tracks for agentic workflows
Emergent delegation evaluation across GAIA, BFCL, and tau-bench
115 long-horizon coding tasks from real version upgrades across 17 repos
15.6k QA pairs over long contexts averaging 138.8k tokens for multi-target memory
End-to-end healthcare workflow automation with 20 apps and 87 MCP tools
Dynamic layer routing with MCTS-supervised per-layer routers for efficient LLM inference
Method note
Sampled 60 of 460 entries, prioritizing benchmarks and systems with the strongest quantitative signal from a deterministic candidate ranking. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.
Big picture
Benchmarks to extract
Papers and links
100 executable tasks, 27 MCP servers, 324 tools, and closed-loop multimodal verification for end-to-end tool use.
Production-style monitoring for MCP agent activity with ADR-Bench covering 302 tasks and 17 attack techniques.
Evaluates browser-accessible delivered games, separating minimum working delivery from excellent requirement satisfaction.
660 SymPy-certified linear algebra problems plus a failure taxonomy for diagnosing mathematical reasoning.
Claims reasoning failures concentrate in a small number of early tokens, useful for intervention and evaluation design.
Turns successes and failures into graph memory, giving self-evolving agents a more inspectable substrate.
Method note
Full arXiv /new batch collected on May 19. The LLM scout covered all new submissions; deterministic benchmark detection covered the full batch. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.
Big picture
Benchmarks to extract
Papers and links
A practical benchmark direction for agents operating across SaaS workflows, useful for procurement-style agent evaluation.
High-priority e-commerce agent environment surfaced by both the scout and deterministic screen.
Agent architecture signal around designing and evaluating practical multi-step systems.
Runtime monitoring and formal constraints as a concrete control layer for LLM agents.
Deep-research agent work that matches CodeSOTA's interest in paper-to-evidence workflows.
Open medical AI system surfaced as a practical extraction target for model, data, and benchmark claims.
Method note
The dated May 18 arXiv /recent section was collected in full. We stopped the full LLM run after your subsample instruction and used a 60-paper scout plus deterministic screening across all 200 summaries. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.
Big picture
Benchmarks to extract
Papers and links
Agent orchestration and reliability signal around graph-structured coordination.
Agent memory cold-start work; useful for understanding reusable memory in deployed agents.
High-priority benchmark candidate from the Friday benchmark report.
Education-agent evaluation target with obvious CodeSOTA task-page value.
Entity-centered benchmark surfaced as a strong extraction candidate.
Useful framing for separating operational task success from oversight and governance quality.
Method note
This uses the existing May 15 batch and reports already in the local paper pipeline. Claims here are abstract/report level until tables are extracted from individual PDFs. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Next extraction pass
The valuable follow-up is to pull benchmark tables, model lists, task definitions, and failure taxonomies from the strongest papers. That gives CodeSOTA rows users can compare, not merely links they can browse.