Online-Mind2Web is a new benchmark that contains 300 diverse and realistic tasks spanning 136 websites, introduced to assess the current state of web agents. It also includes WebJudge, an automatic evaluation based on LLM-as-a-judge, to facilitate future agent development and evaluation.
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.