Recent Papers / arXiv:2606.05342
SentinelBench: A Benchmark for Long-Running Monitoring Agents
Authors pending
Abstract
100 tasks across 10 synthetic web environments test whether agents can monitor, wait, and react promptly over time rather than acting continuously.
Tasks
editResults
No benchmark results recorded yet.
Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →
CodeSOTA extraction
Benchmark evidence
- SentinelBench task completion and reaction-time tradeoffs (table in §4)