MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

arXiv:2606.03203Submitted Jun 3, 20260 benchmark results

Authors pending

Abstract

18 clinical scenarios across 10 domains; best closed-source model reaches 54.2% strict success, open-source agents average 2.5%.

Tasks

Results

No benchmark results recorded yet.

Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →

CodeSOTA extraction

Link this paper to benchmark rows, datasets, model cards, and reproduced results as evidence is extracted.

Add or update benchmark results

Logged-in editor · benchmark trail