Arena-Hard.

Name: Arena-Hard Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

Arena-Hard is a human-aligned benchmark of challenging open-ended prompts sourced from live crowd platforms (notably Chatbot Arena) designed to robustly separate LLM capability and reflect human preference. It was introduced in the paper “From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline” (arXiv:2406.11939). The Arena-Hard-Auto variant (published on Hugging Face as Arena-Hard-Auto / v0.1) is an automatic evaluation suite that contains 500 challenging user queries extracted from Chatbot Arena and uses an LLM-as-a-judge (the dataset authors report prompting GPT-4-Turbo to act as judge, comparing model responses against a baseline such as GPT-4-0314). The BenchBuilder pipeline described in the paper automates extracting high-quality prompts from crowdsourced data and producing an automatically-judged benchmark with high correlation and separability relative to the live Chatbot Arena. Common uses: automatic and human-aligned evaluation of instruction-tuned LLMs and benchmarking alignment/safety/helpfulness.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

Not enough data to show trend.

§ 02 · Leaderboard

Results by metric.

Only 1 model on this benchmark

Help build the community leaderboard — submit your model results.

Accuracy

Accuracy is the reported evaluation metric for Arena-Hard. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Qwen2.5-Plus dataset: Arena-Hard; task: 5	paper	81.4	N/A	Source ↗

§ 04 · Submit a result

Add to the leaderboard.

← Back to Language Modeling