Recent Papers / arXiv:2606.09426
WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces
Authors pending
Abstract
114 tasks requiring GUI+CLI+code orchestration; best PassRate 41.2%.
Tasks
editResults
No benchmark results recorded yet.
Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →
CodeSOTA extraction
Benchmark evidence
- WeaveBench: extract PassRate and trajectory-aware judge correlation on hybrid tasks