Recent Papers / arXiv:2606.06560
MacArena: Benchmarking Computer Use Agents on an Online macOS Environment
Authors pending
Abstract
421 manually verified tasks (50 apps) on native Apple Silicon; model rankings invert between ported and macOS-native tasks (leader trails by >26%), revealing cross-platform GUI competence gaps.
Tasks
editResults
No benchmark results recorded yet.
Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →
CodeSOTA extraction
Benchmark evidence
- MacArena: per-model accuracy on macOS-native subset (abstract reports leader trailing >26%)