Recent Papers / arXiv:2603.14465

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

arXiv:2603.14465Submitted Jun 2, 20260 benchmark results

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen et al.

Abstract

1,000 trajectories with 8,509 human step annotations (89.1% agreement).

Ternary labeling captures exploration; process signals complement outcome supervision for test-time scaling.

Tasks
edit
Results

No benchmark results recorded yet.

submit

Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →

CodeSOTA extraction

Benchmark evidence

edit
  • AgentProcessBench: Step-level accuracy of process reward models vs. outcome supervision (extract from Table 3).
Add or update benchmark results
Logged-in editor · benchmark trail
Read next

Three places to go from here.

Index
All papers
All tracked papers in the registry, with benchmark result, model, and leaderboard linkage where available.
Replacement
Papers with Code is dead — alternatives
What replaced PWC for each use case: LLMs, OCR, speech, vision, robotics.
Top hub
Agentic AI
Every benchmark in Agentic AI.