Recent Papers / arXiv:2606.06529
Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety
Authors pending
Abstract
Strategic start/stop attack policies reduce measured safety by 20–28pp at 1% audit budget on BashArena and LinuxArena, suggesting control evaluations may be overly optimistic.
Tasks
editResults
No benchmark results recorded yet.
Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →
CodeSOTA extraction
Benchmark evidence
- Attack Selection: exact safety reduction in pp at 1% audit (abstract reports 20pp on BashArena and LinuxArena start policy)