Recent Papers / arXiv:2606.03918

Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

arXiv:2606.03918Submitted Jun 3, 20260 benchmark results

Authors pending

View PDF ↗arXiv page ↗Edit

Abstract

102 real hedge-fund analyst tasks with deterministic grading; frontier models score below 16%.

Tasks

Results

No benchmark results recorded yet.

Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →

CodeSOTA extraction

Benchmark evidence

Confirm Hedge-Bench deterministic grading methodology and whether the <16% score holds for GPT-5 and Claude Opus 4.

Add or update benchmark results

Logged-in editor · benchmark trail

Read next

Three places to go from here.

All tracked papers in the registry, with benchmark result, model, and leaderboard linkage where available.

Papers with Code is dead — alternatives

What replaced PWC for each use case: LLMs, OCR, speech, vision, robotics.

Every benchmark in Benchmarks.